|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
Will
Q converge to the true value of Q
|
|
|
|
(for
the optimal policy)?
|
|
|
• |
Yes,
under certain conditions:
|
|
|
|
1. |
deterministic
MDP
|
|
|
|
2. |
immediate
reward values are bounded
|
|
|
|
• |
for
some positive constant c, |r(s,a)|<c for all s,a
|
|
|
|
3. |
choose
actions such that it visits every state-
|
|
|
action
pair infinitely often
|
|