Further Reading: Temporal
Difference Learning
A generalization of Q-learning with
nondeterminism
Basic idea:  If you use a sequence of actions and
rewards, you can write a more general learning
rule blending estimates from lookahead of
different depths.
Tesauro (1995) trained TD-GAMMON on 1.5
million self-generated games to become nearly
equal to the top-ranked players of international
backgammon tournaments.