|CS 391 - Special Topic: Machine Learning
How do Monte Carlo (MC) methods differ from Dynamic Programming (DP) methods?
What assumption(s) are made by DP methods that are not made by MC methods?
What assumption(s) are made by MC methods that are not made by DP methods?
Describe the first-visit MC method for estimating V^pi. How does the every-visit MC method differ?
Draw the backup diagram for MC methods. How does it differ from backup diagrams for DP methods?
Do MC methods bootstrap? Why or why not?
What are three advantages of MC methods over DP methods?
What is the problem of maintaining exploration and how does it relate to MC estimation of action-values? What is the assumption of exploring starts (ES)?
Describe the MC ES algorithm. How does it differ from every-visit MC?
What does it mean to be an on-policy versus off-policy method?
How does the e-soft on-policy MC control algorithm of Figure 5.6 differ from the MC ES algorithm of Figure 5.4?
How does one learn an action-value estimation for one policy from episodes generated by another policy?
Describe the off-policy MC control algorithm of Figure 5.7. What is w? What are N(s,a) and D(s,a)?
What is the potential problem of this approach?
How is incremental implementation relevant to MC methods?
Due Friday 2/28 at the beginning of class. You are encouraged to work in pairs. An improvised, informal presentation of your work may be requested in class.
Implement and test first-visit Monte Carlo policy iteration with an e-soft policy. Choose one of the following problems:
You must collect and present experimental data (e.g. learned policies, performance, etc.).