•Will Q converge to the true value of Q (for the optimal
policy)?
•Yes, under certain conditions:
2.immediate reward values are
bounded
•for some positive constant c, |r(s,a)|<c for all
s,a
3.choose actions such that it visits
every state-action pair infinitely often
–