CS 371: Introduction to Artificial Intelligence

Reinforcement Learning

The Problem

Set of sensors à set of environment states

All states fit in memory

Set of actions à transition from state to state, immediate reward

Set of terminal (absorbing) states

Wish to learn optimal policy

mapping of states to actions which maximizes expected reward (utility) over time

Often called a "sequential decision problem"

Differences from Function Approximation

Delayed reward:

-30K à -30K à -30K à -30K à +??K à …

Temporal credit assignment problem:

e.g. played excellent game except for one move and lost. Which move?

Exploration:

agent can generate its own training examples autonomously if it (1) has a model of the world, or (2) can continuously explore its world.

exploration (seeing new states) vs. exploitation (doing what looks best so far)

The Learning Task

Markov Decision Process (MDP)

finite set of states S

finite set of actions A

s_t+1 = d(s_t, a_t), reward r_t = r(s_t, a_t)

d and r are part of the environment and not necessarily known

d and r may be nondeterministic (we begin with assumption of determinism)

Learn policy p : S à A optimizing some function of reward over time for MDP

Reward Functions to Optimize

Discounted cumulative reward:

V^p(s_t) = r_t + gr_t+1+ g²r_t+2 + …
= Sum_i=0_à_¥(gⁱr_t+i)

Finite horizon reward:

V^p(s_t) = r_t + r_t+1+ r_t+2 + … + r_t+h= Sum_i=0_à_h(r_t+i)

Average reward:

V^p(s_t) = r_t + r_t+1+ r_t+2 + … + r_t+h= lim_h_à_¥(Sum_i=0_à_h(r_t+i))/h

Assume discounted cumulative reward chosen as goal.

Simple Grid World Example

Wish to find optimal policy p* = p maximizing V^p(s) "s

2´3 grid world

actions move N,S,E,W

goal state G in upper right corner

reward +100 for actions entering G, 0 otherwise

actions cannot exit G (absorbing state)

let discount factor g = 0.9

Q Learning

Can't learn optimal policy p* directly – no <s,a> training pairs

However, agent can seek to learn V^* - the value function of the optimal policy:

p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]

recall V^p(s_t) = r_t + gr_t+1+ g²r_t+2 + …

can only learn V* if you have perfect knowledge of r and d (don't necessarily)

let Q(s,a) = r(s,a) + gV*(d(s,a))

Then p*(s) = a maximizing Q(s,a)

Learn Q à learn p* without knowledge of r and d!

How to Learn Q

How to estimate training values for Q given sequence of immediate rewards spread out over time?

Iterative approximation method

Note: V*(s) = max_a'[Q(s,a')]

Given: Q(s,a) = r(s,a) + gV*(d(s,a))

Therefore:
Q(s,a) = r(s,a) + g max_a'[Q(d(s,a),a')]

How to Learn Q

Q(s,a) = r(s,a) + g max_a'[Q(d(s,a),a')]

We don't know r and d, but this forms the basis for an iterative update based on an observed action transition and reward

Having taken action a in state s, and finding oneself in state s' with immediate reward r:
Q(s,a) ß r + g max_a'[Q(s',a')]

Q Learning Algorithm

For each s,a initialize Q(s,a) ß 0.

Observe current state s.

Do forever:

Select an action a and execute it

Receive immediate reward r

Observe new state s'

Update the table entry for Q(s,a):
Q(s,a) ß r + g max_a'[Q(s',a')]

s ß s'

Grid World Q Learning Example

For each s,a initialize Q(s,a) ß 0.

Observe current state s.

Do forever:

Select an action a and execute it

Receive immediate reward r

Observe new state s'

Update the table entry for Q(s,a):
Q(s,a) ß r + g max_a'[Q(s',a')]

s ß s'

Conditions for Convergence

Will Q converge to the true value of Q (for the optimal policy)?

Yes, under certain conditions:

deterministic MDP

immediate reward values are bounded

for some positive constant c, |r(s,a)|<c for all s,a

choose actions such that it visits every state-action pair infinitely often

Experimentation Strategies

One extreme: Always choose action that looks best so far.

What can potentially happen?

Early bias towards positive reward experience

Bias against exploration for even better reward

Another extreme: Always choose actions randomly with equal probability

Ignores what it has learned à behavior remains random

Want behavior between greedy and random extremes

Simulated annealing ideas applicable here: Start with random behavior to gather information, gradually become greedy to improve performance.

Experimentation Strategies (cont.)

Another possibility: probabilistic approach

Choose actions probabilistic such that there's always a positive probability of choose each action.

One example: P(a_i|s) = k^Q(s,aⁱ⁾ / sum_j(k^Q(s,a^j⁾)

Greater k à greater greedy exploitation

Lesser k à greater random exploration

Speeding Convergence

Updating sequence: start at random state and act until it reaches absorbing goal state

For the first updating sequence and our grid world example, how many weights get updated from the first sequence?

What could we do if we kept the whole sequence in memory?

Speeding Convergence (cont.)

Training using updating sequence in reverse order speeds convergence.

Tradeoff: Requires more memory

Suppose exploration and learning cost great time/expense.

Can retrain on same data repeatedly.

Ratio of old/new update sequences a matter of relative costs for problem domain.

Tradeoff: Requires more memory, less diversity of state/action pairs

Nondeterministic Rewards and Actions

What if r and d are nondeterministic (e.g. roll of a die in a game)?

V^p(s_t) = ExpectedValue[r_t + gr_t+1+ g²r_t+2 + …]
= E[Sum_i=0_à_¥(gⁱr_t+i)]

Q(s,a) = E[r(s,a) + g V*(d(s,a))]
= E[r(s,a)] + g E[V*(d(s,a))]
= E[r(s,a)] + g Sum_s'[P(s'|s,a)V*(s')]

Therefore:
Q(s,a) = E[r(s,a)] + g Sum_s'[P(s'|s,a) max_a'[Q(s',a')]]

Note: This is not an update rule.

Update Rule for Nondeterministic Case

Q(s,a) = E[r(s,a)] + g Sum_s'[P(s'|s,a) max_a'[Q(s',a')]]

Our previous update rule fails to converge.

Suppose we start with the correct Q function.

Nondeterminism will change Q forever.

Need to slow change to Q over time:

Let a_n = 1/(1+visits_n(s,a)) (including current visit)

Q_n(s,a) ß (1-a_n)(old estimate) + a_n(new estimate)

Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])

Conditions for Convergence

If:

rewards are bounded (as before)

the training rule is:
Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])

0 £ g < 1

Sum_i=1_à_¥(a_n(i,s,a)) = ¥ where n(i,s,a) is the iteration corresponding to the ith time a is applied to s

Sum_i=1_à_¥(a_n(i,s,a))² < ¥

Then Q will converge correctly as n à ¥ with probability 1.

Example: Pig Dice Game

In turn, players roll a single die as many times as desire.

If a player stops before rolling a 1, the player adds the total of the numbers rolled in sequence to their cumulative score.

If a player rolls a 1, the player receives no score.

The goal is to be the first player to reach a score of 100.

Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])

a_n = 1/(1+visits_n(s,a))

What are the ramifications of one's choice for g?

How can one best speed convergence observing real game experience?

Example: Simplified Blackjack

Read the homework description of simplified Blackjack.

What would you try for g and why?

What would be the states and actions of your nondeterministic MDP?

Do you expect your optimal policy to match the strategy described?

Further Reading: Temporal Difference Learning

A generalization of Q-learning with nondeterminism

Basic idea: If you use a sequence of actions and rewards, you can write a more general learning rule blending estimates from lookahead of different depths.

Tesauro (1995) trained TD-GAMMON on 1.5 million self-generated games to become nearly equal to the top-ranked players of international backgammon tournaments.


	Set of sensors à set of environment states
		All states fit in memory
	Set of actions à transition from state to state, immediate reward
	Set of terminal (absorbing) states
	Wish to learn optimal policy
		mapping of states to actions which maximizes expected reward (utility) over time
	Often called a "sequential decision problem"


	Delayed reward:
		-30K à -30K à -30K à -30K à +??K à …
	Temporal credit assignment problem:
		e.g. played excellent game except for one move and lost. Which move?
	Exploration:
		agent can generate its own training examples autonomously if it (1) has a model of the world, or (2) can continuously explore its world.
		exploration (seeing new states) vs. exploitation (doing what looks best so far)


Markov Decision Process (MDP)
	finite set of states S
	finite set of actions A
	s_t+1 = d(s_t, a_t), reward r_t = r(s_t, a_t)
		d and r are part of the environment and not necessarily known
		d and r may be nondeterministic (we begin with assumption of determinism)
Learn policy p : S à A optimizing some function of reward over time for MDP


	Discounted cumulative reward:
		V^p(s_t) = r_t + gr_t+1+ g²r_t+2 + … = Sum_i=0_à_¥(gⁱr_t+i)
	Finite horizon reward:
		V^p(s_t) = r_t + r_t+1+ r_t+2 + … + r_t+h= Sum_i=0_à_h(r_t+i)
	Average reward:
		V^p(s_t) = r_t + r_t+1+ r_t+2 + … + r_t+h= lim_h_à_¥(Sum_i=0_à_h(r_t+i))/h
	Assume discounted cumulative reward chosen as goal.


	Wish to find optimal policy p* = p maximizing V^p(s) "s
	2´3 grid world
		actions move N,S,E,W
		goal state G in upper right corner
		reward +100 for actions entering G, 0 otherwise
		actions cannot exit G (absorbing state)
		let discount factor g = 0.9


	Can't learn optimal policy p* directly – no <s,a> training pairs
	However, agent can seek to learn V^* - the value function of the optimal policy:
		p(s) = a maximizing [r(s,a) + gV(d(s,a))]
		recall V^p(s_t) = r_t + gr_t+1+ g²r_t+2 + …
		can only learn V* if you have perfect knowledge of r and d (don't necessarily)
		let Q(s,a) = r(s,a) + gV*(d(s,a))
		Then p*(s) = a maximizing Q(s,a)
	Learn Q à learn p* without knowledge of r and d!


	How to estimate training values for Q given sequence of immediate rewards spread out over time?
	Iterative approximation method
	Note: V*(s) = max_a'[Q(s,a')]
	Given: Q(s,a) = r(s,a) + gV*(d(s,a))
	Therefore: Q(s,a) = r(s,a) + g max_a'[Q(d(s,a),a')]


	Q(s,a) = r(s,a) + g max_a'[Q(d(s,a),a')]
	We don't know r and d, but this forms the basis for an iterative update based on an observed action transition and reward
	Having taken action a in state s, and finding oneself in state s' with immediate reward r: Q(s,a) ß r + g max_a'[Q(s',a')]


	For each s,a initialize Q(s,a) ß 0.
	Observe current state s.
	Do forever:
		Select an action a and execute it
		Receive immediate reward r
		Observe new state s'
		Update the table entry for Q(s,a): Q(s,a) ß r + g max_a'[Q(s',a')]
		s ß s'


Will Q converge to the true value of Q (for the optimal policy)?
Yes, under certain conditions:
	deterministic MDP
	immediate reward values are bounded
		for some positive constant c, \|r(s,a)\|<c for all s,a
	choose actions such that it visits every state-action pair infinitely often


	One extreme: Always choose action that looks best so far.
		What can potentially happen?
		Early bias towards positive reward experience
		Bias against exploration for even better reward
	Another extreme: Always choose actions randomly with equal probability
		Ignores what it has learned à behavior remains random
	Want behavior between greedy and random extremes
	Simulated annealing ideas applicable here: Start with random behavior to gather information, gradually become greedy to improve performance.


	Another possibility: probabilistic approach
	Choose actions probabilistic such that there's always a positive probability of choose each action.
	One example: P(a_i\|s) = k^Q(s,aⁱ⁾ / sum_j(k^Q(s,a^j⁾)
	Greater k à greater greedy exploitation
	Lesser k à greater random exploration


	Updating sequence: start at random state and act until it reaches absorbing goal state
	For the first updating sequence and our grid world example, how many weights get updated from the first sequence?
	What could we do if we kept the whole sequence in memory?


	Training using updating sequence in reverse order speeds convergence.
	Tradeoff: Requires more memory
	Suppose exploration and learning cost great time/expense.
	Can retrain on same data repeatedly.
	Ratio of old/new update sequences a matter of relative costs for problem domain.
	Tradeoff: Requires more memory, less diversity of state/action pairs


	What if r and d are nondeterministic (e.g. roll of a die in a game)?
	V^p(s_t) = ExpectedValue[r_t + gr_t+1+ g²r_t+2 + …] = E[Sum_i=0_à_¥(gⁱr_t+i)]
	Q(s,a) = E[r(s,a) + g V(d(s,a))] = E[r(s,a)] + g E[V(d(s,a))] = E[r(s,a)] + g Sum_s'[P(s'\|s,a)V*(s')]
	Therefore: Q(s,a) = E[r(s,a)] + g Sum_s'[P(s'\|s,a) max_a'[Q(s',a')]]
	Note: This is not an update rule.


	Q(s,a) = E[r(s,a)] + g Sum_s'[P(s'\|s,a) max_a'[Q(s',a')]]
	Our previous update rule fails to converge.
		Suppose we start with the correct Q function.
		Nondeterminism will change Q forever.
	Need to slow change to Q over time:
	Let a_n = 1/(1+visits_n(s,a)) (including current visit)
	Q_n(s,a) ß (1-a_n)(old estimate) + a_n(new estimate)
	Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])


	If:
		rewards are bounded (as before)
		the training rule is: Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])
		0 £ g < 1
		Sum_i=1_à_¥(a_n(i,s,a)) = ¥ where n(i,s,a) is the iteration corresponding to the ith time a is applied to s
		Sum_i=1_à_¥(a_n(i,s,a))² < ¥
	Then Q will converge correctly as n à ¥ with probability 1.


	In turn, players roll a single die as many times as desire.
		If a player stops before rolling a 1, the player adds the total of the numbers rolled in sequence to their cumulative score.
		If a player rolls a 1, the player receives no score.
	The goal is to be the first player to reach a score of 100.
	Q_n(s,a) ß (1-a_n)Q_n-1(s,a) + a_n(r + g max_a'[Q_n-1(s',a')])
	a_n = 1/(1+visits_n(s,a))
	What are the ramifications of one's choice for g?
	How can one best speed convergence observing real game experience?


	Read the homework description of simplified Blackjack.
	What would you try for g and why?
	What would be the states and actions of your nondeterministic MDP?
	Do you expect your optimal policy to match the strategy described?


	A generalization of Q-learning with nondeterminism
	Basic idea: If you use a sequence of actions and rewards, you can write a more general learning rule blending estimates from lookahead of different depths.
	Tesauro (1995) trained TD-GAMMON on 1.5 million self-generated games to become nearly equal to the top-ranked players of international backgammon tournaments.