|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
Delayed
reward:
|
|
|
|
-30K à -30K à -30K à -30K à +??K à …
|
|
|
• |
Temporal
credit assignment problem:
|
|
|
|
– |
e.g.
played excellent game except for one move and
|
|
|
lost. Which move?
|
|
|
• |
Exploration:
|
|
|
|
– |
agent
can generate its own training examples
|
|
|
autonomously
if it (1) has a model of the world, or (2)
|
|
can
continuously explore its world.
|
|
|
|
– |
exploration
(seeing new states) vs. exploitation (doing
|
|
what
looks best so far)
|
|