CS 104: Introduction to Computer Science

The Learning Task

•

Markov Decision Process (MDP)

–

finite set of states S

–

finite set of actions A

–

s_t+1 = d(s_t, a_t), reward r_t = r(s_t, a_t)

•

d and r are part of the environment and not necessarily known

•

d and r may be nondeterministic (we begin with assumption of

determinism)

•

Learn policy p : S à A optimizing some function

of reward over time for MDP