How to Learn Q
How to estimate training values for Q given
sequence of immediate rewards spread out
over time?
Iterative approximation method
Note: V*(s) = maxa'[Q(s,a')]
Given: Q(s,a) = r(s,a) + gV*(d(s,a))
Therefore:
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')]