CS 391 - Special Topic: Machine Learning
Chapter 5 |

- 2/24: Sections 5.1-5.2
- 2/26: Sections 5.3-5.4
- 2/28: Sections 5.5-5.8

- Monte Carlo Policy State- and Action-Value Estimation
- On-/Off-Policy Monte Carlo Control
- Incremental Implementation

How do Monte Carlo (MC) methods differ from Dynamic Programming (DP) methods?

What assumption(s) are made by DP methods that are not made by MC methods?

What assumption(s) are made by MC methods that are not made by DP methods?

Describe the first-visit MC method for estimating V^pi. How does the
every-visit MC method differ?

Draw the backup diagram for MC methods. How does it differ from backup
diagrams for DP methods?

Do MC methods bootstrap? Why or why not?

What are three advantages of MC methods over DP methods?

What is the problem of *maintaining exploration *and how does it relate to
MC estimation of action-values? What is the assumption of *exploring
starts *(ES)?

Describe the MC ES algorithm. How does it differ from every-visit MC?

What does it mean to be an *on-policy *versus *off-policy *method?

How does the e-soft on-policy MC control algorithm of Figure 5.6 differ from the
MC ES algorithm of Figure 5.4?

How does one learn an action-value estimation for one policy from episodes
generated by another policy?

Describe the off-policy MC control algorithm of Figure 5.7. What is *w*?
What are N(s,a) and D(s,a)?

What is the potential problem of this approach?

How is *incremental implementation* relevant to MC methods?

__Programming Assignment__

__HW5__

Due Friday 2/28 at the beginning of class. You are encouraged to work in pairs. An improvised, informal presentation of your work may be requested in class.

Implement and test first-visit Monte Carlo policy iteration with an e-soft policy. Choose one of the following problems:

- an episodic, associative problem you have already implemented,
- a gridworld problem(Example 3.8 or 4.1), or
- a pre-approved problem of your choice.

You must collect and present experimental data (e.g. learned policies, performance, etc.).