CS 104: Introduction to Computer Science

Further Reading: Temporal Difference Learning

•A generalization of Q-learning with nondeterminism

•Basic idea: If you use a sequence of actions and rewards, you can write a more general learning rule blending estimates from lookahead of different depths.

•Tesauro (1995) trained TD-GAMMON on 1.5 million self-generated games to become nearly equal to the top-ranked players of international backgammon tournaments.