© 2000 Todd Neller.  A.I.M.A. text figures © 1995 Prentice Hall.  Used by permission.
The Learning Task
•Markov Decision Process (MDP)
–finite set of states S
–finite set of actions A
–st+1 = d(st, at), reward rt = r(st, at)
•d and r are part of the environment and not necessarily known
•d and r may be nondeterministic (we begin with assumption of determinism)
•Learn policy p : S à A optimizing some function of reward over time for MDP