|
|
|
|
|
|
|
|
|
• |
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')]
|
|
|
|
|
• |
We
don't know r and d, but this forms the
|
|
|
basis
for an iterative update based on an
|
|
|
|
observed
action transition and reward
|
|
|
• |
Having
taken action a in state s, and finding
|
|
|
oneself
in state s' with immediate reward r:
|
|
|
Q(s,a)
ß r + g maxa'[Q(s',a')]
|
|
|
|