© 2000 Todd Neller.  A.I.M.A. text figures © 1995 Prentice Hall.  Used by permission.
How to Learn Q
•How to estimate training values for Q given sequence of immediate rewards spread out over time?
•Iterative approximation method
•Note: V*(s) = maxa'[Q(s,a')]
•Given: Q(s,a) = r(s,a) + gV*(d(s,a))
•Therefore:
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')]