CS 371: Introduction to
Artificial Intelligence
The Problem
Set of sensors à set of
environment states |
All states fit in memory |
Set of actions à transition
from state to state, immediate reward |
Set of terminal (absorbing) states |
Wish to learn optimal policy |
mapping of states to actions which
maximizes expected reward (utility) over time |
Often called a "sequential
decision problem" |
Differences from Function
Delayed reward: |
-30K à -30K à -30K à -30K à +??K à … |
Temporal credit assignment problem: |
e.g. played excellent game except for
one move and lost. Which move? |
Exploration: |
agent can generate its own training
examples autonomously if it (1) has a model of the world, or (2) can
continuously explore its world. |
exploration (seeing new states) vs.
exploitation (doing what looks best so far) |
The Learning Task
Markov Decision Process (MDP) |
finite set of states S |
finite set of actions A |
st+1 = d(st, at), reward rt = r(st,
at) |
d and r are part of the environment and not necessarily known |
d and r may be nondeterministic (we begin with assumption of
determinism) |
Learn policy p : S à A
optimizing some function of reward over time for MDP |
Reward Functions to
Discounted cumulative reward: |
= rt + grt+1 + g2rt+2
+ …
= Sumi=0à¥(girt+i) |
Finite horizon reward: |
= rt + rt+1 + rt+2
+ … + rt+h
= Sumi=0àh(rt+i) |
Average reward: |
= rt + rt+1 + rt+2
+ … + rt+h
= limhà¥(Sumi=0àh(rt+i))/h |
Assume discounted cumulative reward
chosen as goal. |
Simple Grid World Example
Wish to find optimal policy p* = p maximizing Vp(s) "s |
2´3 grid world |
actions move N,S,E,W |
goal state G in upper right corner |
reward +100 for actions entering G, 0
otherwise |
actions cannot exit G (absorbing state) |
let discount factor g = 0.9 |
Q Learning
Can't learn optimal policy p* directly –
no <s,a> training pairs |
However, agent can seek to learn V*
- the value function of the optimal policy: |
p*(s) = a maximizing [r(s,a) + gV*(d(s,a))] |
recall Vp(st) = rt
+ grt+1 + g2rt+2 + … |
can only learn V* if you have perfect
knowledge of r and d (don't necessarily) |
let Q(s,a) = r(s,a) + gV*(d(s,a)) |
Then p*(s) = a maximizing Q(s,a) |
Learn Q à learn p* without knowledge of r and d! |
How to Learn Q
How to estimate training values for Q
given sequence of immediate rewards spread out over time? |
Iterative approximation method |
Note: V*(s) = maxa'[Q(s,a')] |
Given: Q(s,a) = r(s,a) + gV*(d(s,a)) |
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')] |
How to Learn Q
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')] |
We don't know r and d, but this
forms the basis for an iterative update based on an observed action transition and reward |
Having taken action a in state s, and
finding oneself in state s' with immediate reward r:
Q(s,a) ß r + g maxa'[Q(s',a')] |
Q Learning Algorithm
For each s,a initialize Q(s,a) ß 0. |
Observe current state s. |
Do forever: |
Select an action a and execute it |
Receive immediate reward r |
Observe new state s' |
Update the table entry for
Q(s,a) ß r + g maxa'[Q(s',a')] |
s ß s' |
Grid World Q Learning
For each s,a initialize Q(s,a) ß 0. |
Observe current state s. |
Do forever: |
Select an action a and execute it |
Receive immediate reward r |
Observe new state s' |
Update the table entry for
Q(s,a) ß r + g maxa'[Q(s',a')] |
s ß s' |
Conditions for
Will Q converge to the true value of Q
(for the optimal policy)? |
Yes, under certain conditions: |
deterministic MDP |
immediate reward values are bounded |
for some positive constant c,
|r(s,a)|<c for all s,a |
choose actions such that it visits
every state-action pair infinitely often |
One extreme: Always choose action that
looks best so far. |
What can potentially happen? |
Early bias towards positive reward
experience |
Bias against exploration for even
better reward |
Another extreme: Always choose actions
randomly with equal probability |
Ignores what it has learned à behavior
remains random |
Want behavior between greedy and random
extremes |
Simulated annealing ideas applicable
here: Start with random behavior to gather information, gradually become
greedy to improve performance. |
Strategies (cont.)
Another possibility: probabilistic
approach |
Choose actions probabilistic such that
there's always a positive probability of choose each action. |
One example: P(ai|s) = kQ(s,ai)
/ sumj(kQ(s,aj)) |
Greater k à greater greedy
exploitation |
Lesser k à greater random
exploration |
Speeding Convergence
Updating sequence: start at random
state and act until it reaches absorbing goal state |
For the first updating sequence and our
grid world example, how many weights get updated from the first sequence? |
What could we do if we kept the whole
sequence in memory? |
Speeding Convergence
Training using updating sequence in
reverse order speeds convergence. |
Tradeoff: Requires more memory |
Suppose exploration and learning cost
great time/expense. |
Can retrain on same data repeatedly. |
Ratio of old/new update sequences a
matter of relative costs for problem domain. |
Tradeoff: Requires more memory, less
diversity of state/action pairs |
Speeding Convergence
Training using updating sequence in
reverse order speeds convergence. |
Tradeoff: Requires more memory |
Suppose exploration and learning cost
great time/expense. |
Can retrain on same data repeatedly. |
Ratio of old/new update sequences a
matter of relative costs for problem domain. |
Nondeterministic Rewards
and Actions
What if r and d are nondeterministic
(e.g. roll of a die in a game)? |
= ExpectedValue[rt + grt+1 + g2rt+2
+ …]
= E[Sumi=0à¥(girt+i)] |
Q(s,a) = E[r(s,a) + g V*(d(s,a))]
= E[r(s,a)] + g E[V*(d(s,a))]
= E[r(s,a)] + g Sums'[P(s'|s,a)V*(s')] |
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]] |
Note: This is not an update rule. |
Update Rule for
Nondeterministic Case
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]] |
Our previous update rule fails to
converge. |
Suppose we start with the correct Q
function. |
Nondeterminism will change Q forever. |
Need to slow change to Q over time: |
Let an =
1/(1+visitsn(s,a)) (including current visit) |
Qn(s,a) ß (1-an)(old estimate) + an(new
estimate) |
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r +
g maxa'[Qn-1(s',a')]) |
Conditions for Convergence
If: |
rewards are bounded (as before) |
the training rule is:
ß (1-an)Qn-1(s,a)
+ an(r + g maxa'[Qn-1(s',a')]) |
0 £ g < 1 |
Sumi=1à¥(an(i,s,a)) = ¥ where n(i,s,a)
is the iteration corresponding to the ith time a is applied to s |
Sumi=1à¥(an(i,s,a))2 < ¥ |
Then Q will converge
correctly as n à ¥ with probability 1. |
Example: Pig Dice Game
In turn, players roll a single die as
many times as desire. |
If a player stops before rolling a 1,
the player adds the total of the numbers rolled in sequence to their
cumulative score. |
If a player rolls a 1, the player
receives no score. |
The goal is to be the first player to
reach a score of 100. |
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r +
g maxa'[Qn-1(s',a')]) |
an = 1/(1+visitsn(s,a)) |
What are the ramifications of one's
choice for g? |
How can one best speed convergence
observing real game experience? |
Example: Simplified
Read the homework description of
simplified Blackjack. |
What would you try for g and why? |
What would be the states and actions of
your nondeterministic MDP? |
Do you expect your optimal policy to
match the strategy described? |
Further Reading: Temporal
Difference Learning
A generalization of Q-learning with
nondeterminism |
Basic idea: If you use a sequence of actions and rewards, you can write a
more general learning rule blending estimates from lookahead of different
depths. |
Tesauro (1995) trained TD-GAMMON on 1.5
million self-generated games to become nearly equal to the top-ranked players
of international backgammon tournaments. |