CS 391 - Special Topic: Machine Learning
Chapter 2 |
What is the difference between evaluative feedback and instructive
feedback?
Describe the n-armed bandit problem. What would be the model and the
reward function for an n-armed bandit RL system?
Write equation 2.1 and explain what is computed to estimate the action-value
Q*(a).
What is a greedy action selection rule?
What is an e-greedy action selection rule? Explain how the greedy
action selection rule is a special case of the e-greedy action selection rule.
Under what circumstances are greedy or e-greedy action selection rules more
appropriate?
What is the trade-off as you increase/decrease epsilon?
What is the drawback to e-greedy's exploration/exploitation approach that
softmax action selection addresses?
Consider Equation 2.2 and a 3-armed bandit problem where the expected value of
action n is n. That is Q(ak) = k.
(1) For a low temperature tau, show that softmax is nearly greedy.
(2) For a high temperature tau, show that actions are nearly equiprobable.
(3) For an intermediate temperature tau, show that the probabilities of actions
are ranked (i.e. ordered) according to their expected value.
(Note: Please do the math ahead of time and simply present your tau value and
the results of softmax computation.)
Present the derivation of the incremental update rule.
Why is incremental updating important?
What is the general form of the incremental update rule?
What does is mean to have a nonstationary problem?
Why does a constant step-size parameter lead to an exponential or recency-weighted
average?
Is convergence desirable or undesirable for a nonstationary problem? Why?
What does it mean to have optimistic initial values?
What is the benefit of using optimistic initial values? Does this
apply to nonstationary problems? Why or why not?
Describe the difference between associative and nonassociative tasks.
Is the n-armed bandit associative or nonassociative? Why?
Now imagine you are in Vegas pathetically guarding and working a row of 1-armed
bandits. Assume one of the machines hits the jackpot and dumps all of its
accumulated coins. Is this an associative or non associative task?
Why?
Summarize the chapter's conclusions on the state of the art.
Programming Assignment
HW1
Due Monday 1/27 at the beginning of class. An improvised, informal presentation of your work may be requested in class.
Choose one:
HW2
Due Monday 2/3 at the beginning of class. An improvised, informal presentation of your work may be requested in class.
Choose and do one of the above chapter 2 exercises you did not do last week.
Notes: