Differences from Function
Delayed reward:
-30K à -30K à -30K à -30K à +??K à
Temporal credit assignment problem:
e.g. played excellent game except for one move and
lost.  Which move?
agent can generate its own training examples
autonomously if it (1) has a model of the world, or (2)
can continuously explore its world.
exploration (seeing new states) vs. exploitation (doing
what looks best so far)