CS 391 - Special Topic: Machine Learning
Chapter 6 |

- 3/3: Sections 6.1-6.2
- 3/5: Sections 6.3-6.4
- 3/7: Sections 6.5, 6.8-9

- TD-prediction and its advantages
- Sarsa, on-policy TD control
- Q-learning, off-policy TD control
- Afterstates

__Discussion Questions__

How are TD methods similar to DP methods? How are they different?

How are TD methods similar to Monte Carlo methods? How are they different?

Write the TD update rule. Describe Tabular TD(0) policy evaluation.
What is the corresponding backup diagram?

Define *sample backup *and *full backup. *Which methods perform each?

Do TD methods bootstrap? Why or why not?

What is *batch updating*?

What is the difference between a *maximum-likelihood estimate* and a *certainty-equivalence
estimate*? How do MC methods estimate? How do TD methods
estimate?

In what sense is TD(0) optimal?

Write the Sarsa update rule. How does it differ from the TD(0) update
rule?

Describe the Sarsa algorithm. Why is this called an *on-policy*
control algorithm?

Describe the Q-learning algorithm. How does the Q-learning algorithm
differ from the Sarsa algorithm? Why is this called an *off-policy*
control algorithm?

How do these algorithms behave differently for the cliff-walking task and why?

What are afterstates?

How would you formulate the Jack's rental car problem using afterstates?

__Programming Assignment__

__HW6__

Due Friday 3/7 at the beginning of class. If your work is completed and checked by me before I leave on Thursday afternoon, attendance at Friday's class is optional. You must work in pairs on this homework. An improvised, informal presentation of your work may be requested in a later class.

Implement and test the Sarsa algorithm (Fig. 6.9) for the Windy Gridworld
with King's Moves (Exercise 6.6). For output, you should print the
e-soft policy you're using (e.g. e-greedy) and the parameters (e.g. epsilon =
.01). At intervals (e.g. after every *n* trials) print the current
iteration, *greedy *policy, and corresponding *greedy* evaluations of
states. Represent the policy in a tabular layout in the same orientation
as Gridworld locations with -- = stay, N = North, S = South, E = East, W = West,
NE = Northeast, etc. Represent the state values in a tabular layout
showing significant digits such that the tabular layout is less than 80
columns. Do the same with and without the "stay" move.
What is the length of each of the optimal paths? (The optimal policy with
the four directions has a cost of 15.)

Every action (including staying or moving to the goal state) has a reward of
-1. The world does not wrap around. You can make differing
assumptions about the way the world evolves at the boundaries, but *you must
be explicit about your assumptions.* For example, here are two *different
*assumption
sets:

(1) All moves are legal in all states. To compute the resulting state,
change the (x,y) position according to the move and the wind. The move
and/or the wind may take the agent temporarily off the grid, but the final
resulting position is the *closest grid point. *Example: You're
in the top (northmost) row in a column with wind +2 (northward). You move
northeast. This takes you 1 space to the east and three spaces northward,
but the resulting position is directly to the east, because that's the closest
point on the grid to the initial result of your movement.

(2) It is illegal to perform a move that would take the agent off the grid if there were no wind. The move is resolved before the wind blows. Example: You're in the bottom (southmost) row in a column with wind +1 (northward). You cannot move southeast and end up directly to the east, because this would take you off the board in the first phase of movement.