Conditions for Convergence
If:
rewards are bounded (as before)
the training rule is:
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r + g maxa'[Qn-1(s',a')])
0 £ g < 1
Sumi=1à¥(an(i,s,a)) = ¥ where n(i,s,a) is the iteration corresponding
to the ith time a is applied to s
Sumi=1à¥(an(i,s,a))2 < ¥
Then Q will converge correctly as n à ¥ with probability
1.