CS 371: Introduction to
Artificial Intelligence
The Problem
|
|
|
|
Set of sensors à set of
environment states |
|
All states fit in memory |
|
Set of actions à transition
from state to state, immediate reward |
|
Set of terminal (absorbing) states |
|
Wish to learn optimal policy |
|
mapping of states to actions which
maximizes expected reward (utility) over time |
|
Often called a "sequential
decision problem" |
Differences from Function
Approximation
|
|
|
|
Delayed reward: |
|
-30K à -30K à -30K à -30K à +??K à … |
|
Temporal credit assignment problem: |
|
e.g. played excellent game except for
one move and lost. Which move? |
|
Exploration: |
|
agent can generate its own training
examples autonomously if it (1) has a model of the world, or (2) can
continuously explore its world. |
|
exploration (seeing new states) vs.
exploitation (doing what looks best so far) |
The Learning Task
|
|
|
|
|
Markov Decision Process (MDP) |
|
finite set of states S |
|
finite set of actions A |
|
st+1 = d(st, at), reward rt = r(st,
at) |
|
d and r are part of the environment and not necessarily known |
|
d and r may be nondeterministic (we begin with assumption of
determinism) |
|
Learn policy p : S à A
optimizing some function of reward over time for MDP |
Reward Functions to
Optimize
|
|
|
|
Discounted cumulative reward: |
|
Vp(st)
= rt + grt+1 + g2rt+2
+ …
= Sumi=0à¥(girt+i) |
|
Finite horizon reward: |
|
Vp(st)
= rt + rt+1 + rt+2
+ … + rt+h
= Sumi=0àh(rt+i) |
|
Average reward: |
|
Vp(st)
= rt + rt+1 + rt+2
+ … + rt+h
= limhà¥(Sumi=0àh(rt+i))/h |
|
Assume discounted cumulative reward
chosen as goal. |
Simple Grid World Example
|
|
|
|
Wish to find optimal policy p* = p maximizing Vp(s) "s |
|
2´3 grid world |
|
actions move N,S,E,W |
|
goal state G in upper right corner |
|
reward +100 for actions entering G, 0
otherwise |
|
actions cannot exit G (absorbing state) |
|
let discount factor g = 0.9 |
Q Learning
|
|
|
|
Can't learn optimal policy p* directly –
no <s,a> training pairs |
|
However, agent can seek to learn V*
- the value function of the optimal policy: |
|
p*(s) = a maximizing [r(s,a) + gV*(d(s,a))] |
|
recall Vp(st) = rt
+ grt+1 + g2rt+2 + … |
|
can only learn V* if you have perfect
knowledge of r and d (don't necessarily) |
|
let Q(s,a) = r(s,a) + gV*(d(s,a)) |
|
Then p*(s) = a maximizing Q(s,a) |
|
Learn Q à learn p* without knowledge of r and d! |
How to Learn Q
|
|
|
How to estimate training values for Q
given sequence of immediate rewards spread out over time? |
|
Iterative approximation method |
|
Note: V*(s) = maxa'[Q(s,a')] |
|
Given: Q(s,a) = r(s,a) + gV*(d(s,a)) |
|
Therefore:
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')] |
How to Learn Q
|
|
|
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')] |
|
We don't know r and d, but this
forms the basis for an iterative update based on an observed action transition and reward |
|
Having taken action a in state s, and
finding oneself in state s' with immediate reward r:
Q(s,a) ß r + g maxa'[Q(s',a')] |
Q Learning Algorithm
|
|
|
|
For each s,a initialize Q(s,a) ß 0. |
|
Observe current state s. |
|
Do forever: |
|
Select an action a and execute it |
|
Receive immediate reward r |
|
Observe new state s' |
|
Update the table entry for
Q(s,a):
Q(s,a) ß r + g maxa'[Q(s',a')] |
|
s ß s' |
Grid World Q Learning
Example
|
|
|
|
For each s,a initialize Q(s,a) ß 0. |
|
Observe current state s. |
|
Do forever: |
|
Select an action a and execute it |
|
Receive immediate reward r |
|
Observe new state s' |
|
Update the table entry for
Q(s,a):
Q(s,a) ß r + g maxa'[Q(s',a')] |
|
s ß s' |
Conditions for
Convergence
|
|
|
|
|
Will Q converge to the true value of Q
(for the optimal policy)? |
|
Yes, under certain conditions: |
|
deterministic MDP |
|
immediate reward values are bounded |
|
for some positive constant c,
|r(s,a)|<c for all s,a |
|
choose actions such that it visits
every state-action pair infinitely often |
|
|
Experimentation
Strategies
|
|
|
|
One extreme: Always choose action that
looks best so far. |
|
What can potentially happen? |
|
Early bias towards positive reward
experience |
|
Bias against exploration for even
better reward |
|
Another extreme: Always choose actions
randomly with equal probability |
|
Ignores what it has learned à behavior
remains random |
|
Want behavior between greedy and random
extremes |
|
Simulated annealing ideas applicable
here: Start with random behavior to gather information, gradually become
greedy to improve performance. |
|
|
Experimentation
Strategies (cont.)
|
|
|
Another possibility: probabilistic
approach |
|
Choose actions probabilistic such that
there's always a positive probability of choose each action. |
|
One example: P(ai|s) = kQ(s,ai)
/ sumj(kQ(s,aj)) |
|
Greater k à greater greedy
exploitation |
|
Lesser k à greater random
exploration |
Speeding Convergence
|
|
|
Updating sequence: start at random
state and act until it reaches absorbing goal state |
|
For the first updating sequence and our
grid world example, how many weights get updated from the first sequence? |
|
What could we do if we kept the whole
sequence in memory? |
Speeding Convergence
(cont.)
|
|
|
Training using updating sequence in
reverse order speeds convergence. |
|
Tradeoff: Requires more memory |
|
Suppose exploration and learning cost
great time/expense. |
|
Can retrain on same data repeatedly. |
|
Ratio of old/new update sequences a
matter of relative costs for problem domain. |
|
Tradeoff: Requires more memory, less
diversity of state/action pairs |
Speeding Convergence
(cont.)
|
|
|
Training using updating sequence in
reverse order speeds convergence. |
|
Tradeoff: Requires more memory |
|
Suppose exploration and learning cost
great time/expense. |
|
Can retrain on same data repeatedly. |
|
Ratio of old/new update sequences a
matter of relative costs for problem domain. |
Nondeterministic Rewards
and Actions
|
|
|
What if r and d are nondeterministic
(e.g. roll of a die in a game)? |
|
Vp(st)
= ExpectedValue[rt + grt+1 + g2rt+2
+ …]
= E[Sumi=0à¥(girt+i)] |
|
Q(s,a) = E[r(s,a) + g V*(d(s,a))]
= E[r(s,a)] + g E[V*(d(s,a))]
= E[r(s,a)] + g Sums'[P(s'|s,a)V*(s')] |
|
Therefore:
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]] |
|
Note: This is not an update rule. |
Update Rule for
Nondeterministic Case
|
|
|
|
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]] |
|
Our previous update rule fails to
converge. |
|
Suppose we start with the correct Q
function. |
|
Nondeterminism will change Q forever. |
|
Need to slow change to Q over time: |
|
Let an =
1/(1+visitsn(s,a)) (including current visit) |
|
Qn(s,a) ß (1-an)(old estimate) + an(new
estimate) |
|
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r +
g maxa'[Qn-1(s',a')]) |
|
|
Conditions for Convergence
|
|
|
|
If: |
|
rewards are bounded (as before) |
|
the training rule is:
Qn(s,a)
ß (1-an)Qn-1(s,a)
+ an(r + g maxa'[Qn-1(s',a')]) |
|
0 £ g < 1 |
|
Sumi=1à¥(an(i,s,a)) = ¥ where n(i,s,a)
is the iteration corresponding to the ith time a is applied to s |
|
Sumi=1à¥(an(i,s,a))2 < ¥ |
|
Then Q will converge
correctly as n à ¥ with probability 1. |
|
|
Example: Pig Dice Game
|
|
|
|
In turn, players roll a single die as
many times as desire. |
|
If a player stops before rolling a 1,
the player adds the total of the numbers rolled in sequence to their
cumulative score. |
|
If a player rolls a 1, the player
receives no score. |
|
The goal is to be the first player to
reach a score of 100. |
|
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r +
g maxa'[Qn-1(s',a')]) |
|
an = 1/(1+visitsn(s,a)) |
|
What are the ramifications of one's
choice for g? |
|
How can one best speed convergence
observing real game experience? |
Example: Simplified
Blackjack
|
|
|
Read the homework description of
simplified Blackjack. |
|
What would you try for g and why? |
|
What would be the states and actions of
your nondeterministic MDP? |
|
Do you expect your optimal policy to
match the strategy described? |
Further Reading: Temporal
Difference Learning
|
|
|
A generalization of Q-learning with
nondeterminism |
|
Basic idea: If you use a sequence of actions and rewards, you can write a
more general learning rule blending estimates from lookahead of different
depths. |
|
Tesauro (1995) trained TD-GAMMON on 1.5
million self-generated games to become nearly equal to the top-ranked players
of international backgammon tournaments. |