•A generalization of Q-learning with nondeterminism
•Basic idea: If
you use a sequence of actions and rewards, you
can write a more general learning rule blending
estimates from lookahead of different
depths.
•Tesauro (1995) trained TD-GAMMON on 1.5 million self-generated games to become nearly equal to the top-ranked players of international backgammon tournaments.