Conditions for Convergence
Will Q converge to the true value of Q
(for the optimal policy)?
Yes, under certain conditions:
1. deterministic MDP
2. immediate reward values are bounded
for some positive constant c, |r(s,a)|<c for all s,a
3. choose actions such that it visits every state-
action pair infinitely often