CS 391 - Special Topic: Machine Learning Chapter 6

• 3/3: Sections 6.1-6.2
• 3/5: Sections 6.3-6.4
• 3/7: Sections 6.5, 6.8-9
Topics
• Sarsa, on-policy TD control
• Q-learning, off-policy TD control
• Afterstates

Discussion Questions

How are TD methods similar to DP methods?  How are they different?
How are TD methods similar to Monte Carlo methods?  How are they different?
Write the TD update rule.  Describe Tabular TD(0) policy evaluation.  What is the corresponding backup diagram?
Define sample backup and full backup. Which methods perform each?
Do TD methods bootstrap?  Why or why not?
What is batch updating?
What is the difference between a maximum-likelihood estimate and a certainty-equivalence estimate?  How do MC methods estimate?  How do TD methods estimate?
In what sense is TD(0) optimal?
Write the Sarsa update rule.  How does it differ from the TD(0) update rule?
Describe the Sarsa algorithm.  Why is this called an on-policy control algorithm?
Describe the Q-learning algorithm.  How does the Q-learning algorithm differ from the Sarsa algorithm?  Why is this called an off-policy control algorithm?
How do these algorithms behave differently for the cliff-walking task and why?
What are afterstates?
How would you formulate the Jack's rental car problem using afterstates?

Programming Assignment

HW6

Due Friday 3/7 at the beginning of class.  If your work is completed and checked by me before I leave on Thursday afternoon, attendance at Friday's class is optional.  You must work in pairs on this homework.  An improvised, informal presentation of your work may be requested in a later class.

Implement and test the Sarsa algorithm (Fig. 6.9) for the Windy Gridworld with King's Moves (Exercise 6.6).   For output, you should print the e-soft policy you're using (e.g. e-greedy) and the parameters (e.g. epsilon = .01).  At intervals (e.g. after every n trials) print the current iteration, greedy policy, and corresponding greedy evaluations of states.  Represent the policy in a tabular layout in the same orientation as Gridworld locations with -- = stay, N = North, S = South, E = East, W = West, NE = Northeast, etc.  Represent the state values in a tabular layout showing significant digits such that the tabular layout is less than 80 columns.  Do the same with and without the "stay" move.  What is the length of each of the optimal paths?  (The optimal policy with the four directions has a cost of 15.)

Every action (including staying or moving to the goal state) has a reward of -1.  The world does not wrap around.  You can make differing assumptions about the way the world evolves at the boundaries, but you must be explicit about your assumptions.  For example, here are two different assumption sets:

(1) All moves are legal in all states.  To compute the resulting state, change the (x,y) position according to the move and the wind.  The move and/or the wind may take the agent temporarily off the grid, but the final resulting position is the closest grid point.  Example:  You're in the top (northmost) row in a column with wind +2 (northward).  You move northeast.  This takes you 1 space to the east and three spaces northward, but the resulting position is directly to the east, because that's the closest point on the grid to the initial result of your movement.

(2) It is illegal to perform a move that would take the agent off the grid if there were no wind.  The move is resolved before the wind blows.  Example: You're in the bottom (southmost) row in a column with wind +1 (northward).  You cannot move southeast and end up directly to the east, because this would take you off the board in the first phase of movement.