The Problem
Set of sensors à set of environment states
All states fit in memory
Set of actions à transition from state to state,
immediate reward
Set of terminal (absorbing) states
Wish to learn optimal policy
mapping of states to actions which maximizes expected
reward (utility) over time
Often called a "sequential decision problem"