|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
If:
|
|
|
|
– |
rewards
are bounded (as before)
|
|
|
|
– |
the
training rule is:
|
|
|
Qn(s,a)
ß (1-an)Qn-1(s,a) + an(r
+ g maxa'[Qn-1(s',a')])
|
|
|
|
– |
0 £ g < 1
|
|
|
|
– |
Sumi=1à¥(an(i,s,a))
= ¥ where n(i,s,a) is the iteration corresponding
|
|
|
|
|
to
the ith time a is applied to s
|
|
|
|
– |
Sumi=1à¥(an(i,s,a))2
< ¥
|
|
|
|
|
• |
Then
Q will converge correctly as n à ¥ with probability
|
|
1.
|
|
|
|