CS 104: Introduction to Computer Science

How to Learn Q

•How to estimate training values for Q given sequence of immediate rewards spread out over time?

•Iterative approximation method

•Note: V*(s) = maxa'[Q(s,a')]

•Given: Q(s,a) = r(s,a) + gV*(d(s,a))

•Therefore:
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')]