|
|
|
|
|
|
|
|
|
|
|
• |
A
generalization of Q-learning with
|
|
|
nondeterminism
|
|
|
• |
Basic
idea: If you use a sequence of
actions and
|
|
rewards,
you can write a more general learning
|
|
|
rule
blending estimates from lookahead of
|
|
|
different
depths.
|
|
|
• |
Tesauro
(1995) trained TD-GAMMON on 1.5
|
|
|
million
self-generated games to become nearly
|
|
|
equal
to the top-ranked players of international
|
|
|
backgammon
tournaments.
|
|