CS 104: Introduction to Computer Science

Conditions for Convergence

•

If:

–

rewards are bounded (as before)

–

the training rule is:

Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])

–

0 £ g < 1

–

Sum_i=1_à_¥(a_n(i,s,a)) = ¥ where n(i,s,a) is the iteration corresponding

to the ith time a is applied to s

–

Sum_i=1_à_¥(a_n(i,s,a))² < ¥

•

Then Q will converge correctly as n à ¥ with probability