–-30K à -30K à -30K à -30K à +??K à …
•Temporal credit assignment problem:
–e.g. played excellent game except for one move and lost. Which move?
–agent can generate its own training examples autonomously if it (1) has a model of the world, or (2) can continuously explore its world.
–exploration (seeing new states) vs. exploitation (doing what looks best so far)