|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
Markov
Decision Process (MDP)
|
|
|
|
– |
finite
set of states S
|
|
|
|
– |
finite
set of actions A
|
|
|
|
– |
st+1 = d(st, at), reward rt
= r(st, at)
|
|
|
|
|
|
• |
d and r are part of the environment and not
necessarily known
|
|
|
|
• |
d and r may be nondeterministic (we begin with
assumption of
|
|
determinism)
|
|
|
• |
Learn
policy p : S à A optimizing some function
|
|
|
|
of
reward over time for MDP
|
|