–rewards are bounded
(as before)
–the training rule
is:
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r + g maxa'[Qn-1(s',a')])
–Sumi=1à¥(an(i,s,a)) = ¥ where n(i,s,a) is the iteration corresponding to the ith time a is applied to s
–Sumi=1à¥(an(i,s,a))2 < ¥
•Then Q will converge
correctly as n à ¥ with probability 1.
–