•Can't learn optimal policy p* directly – no <s,a> training pairs
•However, agent can seek to learn V* - the value function of the optimal policy:
–p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]
–recall Vp(st) = rt + grt+1 + g2rt+2 + …
–can only learn V* if you have perfect knowledge of r and d (don't
necessarily)
–let Q(s,a) = r(s,a) + gV*(d(s,a))
–Then p*(s) = a maximizing Q(s,a)
•Learn Q à learn p* without
knowledge of r and d!