Q Learning
Can't learn optimal policy p* directly – no <s,a>
training pairs
However, agent can seek to learn V* - the value
function of the optimal policy:
p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]
recall Vp(st) =  rt + grt+1 + g2rt+2 + …
can only learn V* if you have perfect knowledge of r
and d (don't necessarily)
let Q(s,a) = r(s,a) + gV*(d(s,a))
Then p*(s) = a maximizing Q(s,a)
Learn Q à learn p* without knowledge of r and d!