Chapter 4

CS 391 - Special Topic: Machine Learning
Chapter 4

Readings

2/5: Sections 4.1-4.2
2/7: Sections 4.3-4.4
2/10: Sections 4.5-4.8

Topics

Dynamic Programming (DP) assumptions
Policy evaluation (a.k.a. prediction problem)
Policy improvement
Policy iteration
Value iteration
Asynchronous DP
Generalized policy iteration (GPI)
Computational complexity

Discussion Questions

What assumption is made by DP methods that limit their applicability?
What is policy evaluation (a.k.a the prediction problem)?
What is the update rule for iterative policy evaluation?
What is a "full backup"?
If updates are done "in place", will the algorithm converge to correct values for V^pi? If you didn't update "in place", what additional data structures would you need?
Does the ordering of states for "in place" have any effect on convergence? If yes, how? If no, why not?
What is the termination condition for iterative policy evaluation?
3-Room Navigation: A robot's policy is to randomly navigate back and forth between three rooms {left, middle, right}. For each movement between rooms, there is a reward of -1 (i.e. a cost). Upon entering room left, the robot powers down (i.e. the episode ends; there are no further costs).
-Write the Bellman equations for V^pi.
-Solve them algebraically.
-Perform 4 iterations of iterative policy evaluation (not) in place.
What is policy improvement?
How would policy improvement apply to the 3-Room Navigation problem above?
What is policy iteration, and how does it relate to policy evaluation and policy improvement?
How would policy iteration apply to the 3-Room Navigation problem above?
Describe value iteration. How does it differ from policy iteration?
How does the 3-Room Navigation problem motivate use of value iteration?
What are asynchronous DP algorithms? What advantages do they have?
What backup constraint is necessary to guarantee convergence?
Describe generalized policy iteration. How is it generalized from both value iteration and policy iteration?
The methods of this chapter can be viewed as a continual relaxation of constraints. What constraints are removed at each stage?
Describe the general interaction between evaluation and improvement processes.
What is the time complexity of DP methods?
What is the curse of dimensionality and how does it apply to DP methods?
What do the authors note about he practical applicability of DP methods?
What is bootstrapping in the context of reinforcement learning?

Programming Assignment

HW4

Due Friday 2/21 at the beginning of class. An improvised, informal presentation of your work may be requested in class.

Choose one of the following problems:

The Gambler's Problem (Example 4.3)
Piglet with goal = 10 (described here and in handout)
a pre-approved problem of your choice

and apply either policy iteration or value iteration to compute an optimal policy. Present the optimal policy in a clear form, and present data on the performance of the optimal policy.