|
|
|
|
|
|
|
|
|
|
• |
How
to estimate training values for Q given
|
|
sequence
of immediate rewards spread out
|
|
|
over
time?
|
|
|
• |
Iterative
approximation method
|
|
|
• |
Note: V*(s) = maxa'[Q(s,a')]
|
|
|
|
|
• |
Given:
Q(s,a) = r(s,a) + gV*(d(s,a))
|
|
|
• |
Therefore:
|
|
|
Q(s,a)
= r(s,a) + g maxa'[Q(d(s,a),a')]
|
|
|
|