|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
Can't
learn optimal policy p* directly – no <s,a>
|
|
|
training
pairs
|
|
|
• |
However, agent can seek to learn V* - the value
|
|
|
|
|
function
of the optimal policy:
|
|
|
|
– |
p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]
|
|
|
– |
recall Vp(st) = rt + grt+1 + g2rt+2
+ …
|
|
|
|
|
|
– |
can
only learn V* if you have perfect knowledge of r
|
|
|
and d (don't
necessarily)
|
|
|
|
– |
let
Q(s,a) = r(s,a) + gV*(d(s,a))
|
|
|
|
– |
Then p*(s) = a
maximizing Q(s,a)
|
|
|
• |
Learn
Q à learn p* without knowledge of r and d!
|