•Q(s,a)
= r(s,a) + g maxa'[Q(d(s,a),a')]
•We
don't know r and d,
but this forms the basis for an iterative update based on an observed action
transition and reward
•Having
taken action a in state s, and finding oneself in state s' with immediate reward
r:
Q(s,a) ß
r + g maxa'[Q(s',a')]