CS 371: Introduction to Artificial Intelligence
Reinforcement Learning

The Problem
Set of sensors à set of environment states
All states fit in memory
Set of actions à transition from state to state, immediate reward
Set of terminal (absorbing) states
Wish to learn optimal policy
mapping of states to actions which maximizes expected reward (utility) over time
Often called a "sequential decision problem"

Differences from Function Approximation
Delayed reward:
-30K à -30K à -30K à -30K à +??K à
Temporal credit assignment problem:
e.g. played excellent game except for one move and lost.  Which move?
Exploration:
agent can generate its own training examples autonomously if it (1) has a model of the world, or (2) can continuously explore its world.
exploration (seeing new states) vs. exploitation (doing what looks best so far)

The Learning Task
Markov Decision Process (MDP)
finite set of states S
finite set of actions A
st+1 = d(st, at), reward rt = r(st, at)
d and r are part of the environment and not necessarily known
d and r may be nondeterministic (we begin with assumption of determinism)
Learn policy p : S à A optimizing some function of reward over time for MDP

Reward Functions to Optimize
Discounted cumulative reward:
Vp(st) =  rt + grt+1 + g2rt+2 + …
= Sumi=0
à¥(girt+i)
Finite horizon reward:
Vp(st) =  rt + rt+1 + rt+2 + … + rt+h
= Sumi=0
àh(rt+i)
Average reward:
Vp(st) =  rt + rt+1 + rt+2 + … + rt+h
= limh
à¥(Sumi=0àh(rt+i))/h
Assume discounted cumulative reward chosen as goal.

Simple Grid World Example
Wish to find optimal policy p* = p maximizing Vp(s) "s
2´3 grid world
actions move N,S,E,W
goal state G in upper right corner
reward +100 for actions entering G, 0 otherwise
actions cannot exit G (absorbing state)
let discount factor g = 0.9

Q Learning
Can't learn optimal policy p* directly – no <s,a> training pairs
However, agent can seek to learn V* - the value function of the optimal policy:
p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]
recall Vp(st) =  rt + grt+1 + g2rt+2 + …
can only learn V* if you have perfect knowledge of r and d (don't necessarily)
let Q(s,a) = r(s,a) + gV*(d(s,a))
Then p*(s) = a maximizing Q(s,a)
Learn Q à learn p* without knowledge of r and d!

How to Learn Q
How to estimate training values for Q given sequence of immediate rewards spread out over time?
Iterative approximation method
Note: V*(s) = maxa'[Q(s,a')]
Given: Q(s,a) = r(s,a) + gV*(d(s,a))
Therefore:
Q(s,a) = r(s,a) +
g maxa'[Q(d(s,a),a')]

How to Learn Q
Q(s,a) = r(s,a) + g maxa'[Q(d(s,a),a')]
We don't know r and d, but this forms the basis for an iterative update based on an observed action transition and reward
Having taken action a in state s, and finding oneself in state s' with immediate reward r:
Q(s,a)
ß r + g maxa'[Q(s',a')]

Q Learning Algorithm
For each s,a initialize Q(s,a) ß 0.
Observe current state s.
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe new state s'
Update the table entry for Q(s,a):
Q(s,a)
ß r + g maxa'[Q(s',a')]
s ß s'

Grid World Q Learning Example
For each s,a initialize Q(s,a) ß 0.
Observe current state s.
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe new state s'
Update the table entry for Q(s,a):
Q(s,a)
ß r + g maxa'[Q(s',a')]
s ß s'

Conditions for Convergence
Will Q converge to the true value of Q (for the optimal policy)?
Yes, under certain conditions:
deterministic MDP
immediate reward values are bounded
for some positive constant c, |r(s,a)|<c for all s,a
choose actions such that it visits every state-action pair infinitely often

Experimentation Strategies
One extreme: Always choose action that looks best so far.
What can potentially happen?
Early bias towards positive reward experience
Bias against exploration for even better reward
Another extreme: Always choose actions randomly with equal probability
Ignores what it has learned à behavior remains random
Want behavior between greedy and random extremes
Simulated annealing ideas applicable here: Start with random behavior to gather information, gradually become greedy to improve performance.

Experimentation Strategies (cont.)
Another possibility: probabilistic approach
Choose actions probabilistic such that there's always a positive probability of choose each action.
One example: P(ai|s) = kQ(s,ai) / sumj(kQ(s,aj))
Greater k à greater greedy exploitation
Lesser k à greater random exploration

Speeding Convergence
Updating sequence: start at random state and act until it reaches absorbing goal state
For the first updating sequence and our grid world example, how many weights get updated from the first sequence?
What could we do if we kept the whole sequence in memory?

Speeding Convergence (cont.)
Training using updating sequence in reverse order speeds convergence.
Tradeoff: Requires more memory
Suppose exploration and learning cost great time/expense.
Can retrain on same data repeatedly.
Ratio of old/new update sequences a matter of relative costs for problem domain.
Tradeoff: Requires more memory, less diversity of state/action pairs

Speeding Convergence (cont.)
Training using updating sequence in reverse order speeds convergence.
Tradeoff: Requires more memory
Suppose exploration and learning cost great time/expense.
Can retrain on same data repeatedly.
Ratio of old/new update sequences a matter of relative costs for problem domain.

Nondeterministic Rewards and Actions
What if r and d are nondeterministic (e.g. roll of a die in a game)?
Vp(st) =  ExpectedValue[rt + grt+1 + g2rt+2 + …]
= E[Sumi=0
à¥(girt+i)]
Q(s,a) = E[r(s,a) + g V*(d(s,a))]
= E[r(s,a)] +
g E[V*(d(s,a))]
= E[r(s,a)] +
g Sums'[P(s'|s,a)V*(s')]
Therefore:
Q(s,a) = E[r(s,a)] +
g Sums'[P(s'|s,a) maxa'[Q(s',a')]]
Note: This is not an update rule.

Update Rule for Nondeterministic Case
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]]
Our previous update rule fails to converge.
Suppose we start with the correct Q function.
Nondeterminism will change Q forever.
Need to slow change to Q over time:
Let an = 1/(1+visitsn(s,a)) (including current visit)
Qn(s,a) ß (1-an)(old estimate) + an(new estimate)
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r + g maxa'[Qn-1(s',a')])

Conditions for Convergence
If:
rewards are bounded (as before)
the training rule is:
Qn(s,a)
ß (1-an)Qn-1(s,a) + an(r + g maxa'[Qn-1(s',a')])
0 £ g < 1
Sumi=1à¥(an(i,s,a)) = ¥ where n(i,s,a) is the iteration corresponding to the ith time a is applied to s
Sumi=1à¥(an(i,s,a))2 < ¥
Then Q will converge correctly as n à ¥ with probability 1.

Example: Pig Dice Game
In turn, players roll a single die as many times as desire.
If a player stops before rolling a 1, the player adds the total of the numbers rolled in sequence to their cumulative score.
If a player rolls a 1, the player receives no score.
The goal is to be the first player to reach a score of 100.
Qn(s,a) ß (1-an)Qn-1(s,a) + an(r + g maxa'[Qn-1(s',a')])
an = 1/(1+visitsn(s,a))
What are the ramifications of one's choice for g?
How can one best speed convergence observing real game experience?

Example: Simplified Blackjack
Read the homework description of simplified Blackjack.
What would you try for g and why?
What would be the states and actions of your nondeterministic MDP?
Do you expect your optimal policy to match the strategy described?

Further Reading: Temporal Difference Learning
A generalization of Q-learning with nondeterminism
Basic idea:  If you use a sequence of actions and rewards, you can write a more general learning rule blending estimates from lookahead of different depths.
Tesauro (1995) trained TD-GAMMON on 1.5 million self-generated games to become nearly equal to the top-ranked players of international backgammon tournaments.