CS 104: Introduction to Computer Science

•

Q(s,a) = E[r(s,a)] + g Sum_s'[P(s'|s,a) max_a'[Q(s',a')]]

•

Our previous update rule fails to converge.

–

Suppose we start with the correct Q function.

–

Nondeterminism will change Q forever.

•

Need to slow change to Q over time:

•

Let a_n = 1/(1+visits_n(s,a)) (including current visit)

•

Q_n(s,a) ß (1-a_n)(old estimate) + a_n(new estimate)

•

Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])