CS 104: Introduction to Computer Science

How to Learn Q


•	How to estimate training values for Q given
	sequence of immediate rewards spread out
	over time?

•	Iterative approximation method

•	Note: V*(s) = max_a'[Q(s,a')]


•	Given: Q(s,a) = r(s,a) + gV*(d(s,a))

•	Therefore:
	Q(s,a) = r(s,a) + g max_a'[Q(d(s,a),a')]