•Markov Decision
Process (MDP)
–st+1 = d(st, at), reward rt = r(st, at)
•d and r are part of
the environment and not necessarily known
•d and r may be
nondeterministic (we begin with assumption of determinism)
•Learn policy p : S à A
optimizing some function of reward over
time for MDP