|
|
|
|
|
Set of sensors à set of environment states |
|
All states fit in memory |
|
Set of actions à transition from state
to state, immediate reward |
|
Set of terminal (absorbing) states |
|
Wish to learn optimal policy |
|
mapping of states to actions which maximizes
expected reward (utility) over time |
|
Often called a "sequential decision
problem" |
|
|
|
|
|
Delayed reward: |
|
-30K à -30K à -30K à -30K à +??K à … |
|
Temporal credit assignment problem: |
|
e.g. played excellent game except for one move
and lost. Which move? |
|
Exploration: |
|
agent can generate its own training examples
autonomously if it (1) has a model of the world, or (2) can continuously
explore its world. |
|
exploration (seeing new states) vs. exploitation
(doing what looks best so far) |
|
|
|
|
|
|
Markov Decision Process (MDP) |
|
finite set of states S |
|
finite set of actions A |
|
st+1 = d(st, at),
reward rt = r(st, at) |
|
d and r are part of the environment and not necessarily known |
|
d and r may be nondeterministic (we begin with assumption of
determinism) |
|
Learn policy p : S à A optimizing some function of reward over
time for MDP |
|
|
|
|
|
Discounted cumulative reward: |
|
Vp(st) = rt + grt+1 + g2rt+2
+ …
= Sumi=0à¥(girt+i) |
|
Finite horizon reward: |
|
Vp(st) = rt + rt+1 + rt+2 + … + rt+h
=
Sumi=0àh(rt+i) |
|
Average reward: |
|
Vp(st) = rt + rt+1 + rt+2 + … + rt+h
=
limhà¥(Sumi=0àh(rt+i))/h |
|
Assume discounted cumulative reward chosen as
goal. |
|
|
|
|
|
Wish to find optimal policy p* = p maximizing Vp(s)
"s |
|
2´3 grid world |
|
actions move N,S,E,W |
|
goal state G in upper right corner |
|
reward +100 for actions entering G, 0 otherwise |
|
actions cannot exit G (absorbing state) |
|
let discount factor g = 0.9 |
|
|
|
|
|
Can't learn optimal policy p* directly –
no <s,a> training pairs |
|
However, agent can seek to learn V* -
the value function of the optimal policy: |
|
p*(s) = a maximizing [r(s,a) + gV*(d(s,a))] |
|
recall Vp(st) = rt + grt+1 + g2rt+2
+ … |
|
can only learn V* if you have perfect knowledge
of r and d (don't necessarily) |
|
let Q(s,a) = r(s,a) + gV*(d(s,a)) |
|
Then p*(s) = a maximizing Q(s,a) |
|
Learn Q à learn p* without knowledge of r and d! |
|
|
|
|
How to estimate training values for Q given
sequence of immediate rewards spread out over time? |
|
Iterative approximation method |
|
Note: V*(s) = maxa'[Q(s,a')] |
|
Given: Q(s,a) = r(s,a) + gV*(d(s,a)) |
|
Therefore:
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')] |
|
|
|
|
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')] |
|
We don't know r and d, but this forms the
basis for an iterative update based on an observed action transition and
reward |
|
Having taken action a in state s, and finding
oneself in state s' with immediate reward r:
Q(s,a) ß r + g maxa'[Q(s',a')] |
|
|
|
|
|
For each s,a initialize Q(s,a) ß 0. |
|
Observe current state s. |
|
Do forever: |
|
Select an action a and execute it |
|
Receive immediate reward r |
|
Observe new state s' |
|
Update the table entry for Q(s,a):
Q(s,a)
ß r + g maxa'[Q(s',a')] |
|
s ß s' |
|
|
|
|
|
For each s,a initialize Q(s,a) ß 0. |
|
Observe current state s. |
|
Do forever: |
|
Select an action a and execute it |
|
Receive immediate reward r |
|
Observe new state s' |
|
Update the table entry for Q(s,a):
Q(s,a)
ß r + g maxa'[Q(s',a')] |
|
s ß s' |
|
|
|
|
|
|
Will Q converge to the true value of Q (for the
optimal policy)? |
|
Yes, under certain conditions: |
|
deterministic MDP |
|
immediate reward values are bounded |
|
for some positive constant c, |r(s,a)|<c for
all s,a |
|
choose actions such that it visits every
state-action pair infinitely often |
|
|
|
|
|
|
|
One extreme: Always choose action that looks
best so far. |
|
What can potentially happen? |
|
Early bias towards positive reward experience |
|
Bias against exploration for even better reward |
|
Another extreme: Always choose actions randomly
with equal probability |
|
Ignores what it has learned à behavior
remains random |
|
Want behavior between greedy and random extremes |
|
Simulated annealing ideas applicable here: Start
with random behavior to gather information, gradually become greedy to
improve performance. |
|
|
|
|
|
|
Another possibility: probabilistic approach |
|
Choose actions probabilistic such that there's
always a positive probability of choose each action. |
|
One example: P(ai|s) = kQ(s,ai)
/ sumj(kQ(s,aj)) |
|
Greater k à greater greedy exploitation |
|
Lesser k à greater random exploration |
|
|
|
|
Updating sequence: start at random state and act
until it reaches absorbing goal state |
|
For the first updating sequence and our grid
world example, how many weights get updated from the first sequence? |
|
What could we do if we kept the whole sequence
in memory? |
|
|
|
|
Training using updating sequence in reverse
order speeds convergence. |
|
Tradeoff: Requires more memory |
|
Suppose exploration and learning cost great
time/expense. |
|
Can retrain on same data repeatedly. |
|
Ratio of old/new update sequences a matter of
relative costs for problem domain. |
|
Tradeoff: Requires more memory, less diversity
of state/action pairs |
|
|
|
|
Training using updating sequence in reverse
order speeds convergence. |
|
Tradeoff: Requires more memory |
|
Suppose exploration and learning cost great
time/expense. |
|
Can retrain on same data repeatedly. |
|
Ratio of old/new update sequences a matter of
relative costs for problem domain. |
|
|
|
|
What if r and d are nondeterministic (e.g. roll of a die in
a game)? |
|
Vp(st) = ExpectedValue[rt + grt+1 + g2rt+2
+ …]
= E[Sumi=0à¥(girt+i)] |
|
Q(s,a) = E[r(s,a) + g V*(d(s,a))]
= E[r(s,a)] + g E[V*(d(s,a))]
= E[r(s,a)] + g Sums'[P(s'|s,a)V*(s')] |
|
Therefore:
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a)
maxa'[Q(s',a')]] |
|
Note: This is not an update rule. |
|
|
|
|
|
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a)
maxa'[Q(s',a')]] |
|
Our previous update rule fails to converge. |
|
Suppose we start with the correct Q function. |
|
Nondeterminism will change Q forever. |
|
Need to slow change to Q over time: |
|
Let an = 1/(1+visitsn(s,a))
(including current visit) |
|
Qn(s,a) ß (1-an)(old
estimate) + an(new estimate) |
|
Qn(s,a) ß (1-an)Qn-1(s,a)
+ an(r + g maxa'[Qn-1(s',a')]) |
|
|
|
|
|
|
|
If: |
|
rewards are bounded (as before) |
|
the training rule is:
Qn(s,a)
ß (1-an)Qn-1(s,a) + an(r
+ g maxa'[Qn-1(s',a')]) |
|
0 £ g < 1 |
|
Sumi=1à¥(an(i,s,a)) = ¥ where n(i,s,a)
is the iteration corresponding to the ith time a is applied to s |
|
Sumi=1à¥(an(i,s,a))2 < ¥ |
|
Then Q will converge
correctly as n à ¥ with probability 1. |
|
|
|
|
|
|
|
In turn, players roll a single die as many times
as desire. |
|
If a player stops before rolling a 1, the player
adds the total of the numbers rolled in sequence to their cumulative score. |
|
If a player rolls a 1, the player receives no
score. |
|
The goal is to be the first player to reach a
score of 100. |
|
Qn(s,a) ß (1-an)Qn-1(s,a)
+ an(r + g maxa'[Qn-1(s',a')]) |
|
an = 1/(1+visitsn(s,a)) |
|
What are the ramifications of one's choice for g? |
|
How can one best speed convergence observing
real game experience? |
|
|
|
|
Read the homework description of simplified
Blackjack. |
|
What would you try for g and why? |
|
What would be the states and actions of your
nondeterministic MDP? |
|
Do you expect your optimal policy to match the
strategy described? |
|
|
|
|
A generalization of Q-learning with
nondeterminism |
|
Basic idea:
If you use a sequence of actions and rewards, you can write a more
general learning rule blending estimates from lookahead of different
depths. |
|
Tesauro (1995) trained TD-GAMMON on 1.5 million
self-generated games to become nearly equal to the top-ranked players of
international backgammon tournaments. |
|