CS 104: Introduction to Computer Science

The Learning Task

•Markov Decision Process (MDP)

–finite set of states S

–finite set of actions A

–st+1 = d(st, at), reward rt = r(st, at)

•d and r are part of the environment and not necessarily known

•d and r may be nondeterministic (we begin with assumption of determinism)

•Learn policy p : S à A optimizing some function of reward over time for MDP