|
|
|
|
|
|
|
|
|
|
|
|
• |
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]]
|
|
|
|
|
• |
Our
previous update rule fails to converge.
|
|
|
|
– |
Suppose
we start with the correct Q function.
|
|
|
|
– |
Nondeterminism
will change Q forever.
|
|
|
• |
Need
to slow change to Q over time:
|
|
|
• |
Let an = 1/(1+visitsn(s,a))
(including current visit)
|
|
|
|
|
• |
Qn(s,a) ß (1-an)(old estimate) + an(new
estimate)
|
|
|
|
|
• |
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r
+ g maxa'[Qn-1(s',a')])
|
|