|
|
|
|
|
|
|
|
|
|
|
• |
Training
using updating sequence in reverse
|
|
|
order
speeds convergence.
|
|
|
• |
Tradeoff:
Requires more memory
|
|
|
• |
Suppose
exploration and learning cost great
|
|
|
time/expense.
|
|
|
• |
Can
retrain on same data repeatedly.
|
|
|
• |
Ratio
of old/new update sequences a matter of
|
|
|
relative
costs for problem domain.
|
|
|
• |
Tradeoff:
Requires more memory, less diversity of
|
|
state/action
pairs
|
|