The Learning Task
Markov Decision Process (MDP)
finite set of states S
finite set of actions A
st+1 = d(st, at), reward rt = r(st, at)
d and r are part of the environment and not necessarily known
d and r may be nondeterministic (we begin with assumption of
determinism)
Learn policy p : S à A optimizing some function
of reward over time for MDP