CS 391 - Special Topic: Machine Learning
Chapter 2


Readings Topics Discussion Questions

What is the difference between evaluative feedback and instructive feedback?
Describe the n-armed bandit problem.  What would be the model and the reward function for an n-armed bandit RL system?
Write equation 2.1 and explain what is computed to estimate the action-value Q*(a).
What is a greedy action selection rule?
What is an e-greedy action selection rule?  Explain how the greedy action selection rule is a special case of the e-greedy action selection rule.
Under what circumstances are greedy or e-greedy action selection rules more appropriate?
What is the trade-off as you increase/decrease epsilon?
What is the drawback to e-greedy's exploration/exploitation approach that softmax action selection addresses?
Consider Equation 2.2 and a 3-armed bandit problem where the expected value of action n is n.  That is Q(ak) = k.
(1) For a low temperature tau, show that softmax is nearly greedy.
(2) For a high temperature tau, show that actions are nearly equiprobable.
(3) For an intermediate temperature tau, show that the probabilities of actions are ranked (i.e. ordered) according to their expected value.
(Note: Please do the math ahead of time and simply present your tau value and the results of softmax computation.)
Present the derivation of the incremental update rule.
Why is incremental updating important?
What is the general form of the incremental update rule?
What does is mean to have a nonstationary problem?
Why does a constant step-size parameter lead to an exponential or recency-weighted average?
Is convergence desirable or undesirable for a nonstationary problem?  Why?
What does it mean to have optimistic initial values?
What is the benefit of using optimistic initial values?  Does this apply to nonstationary problems? Why or why not?
Describe the difference between associative and nonassociative tasks.
Is the n-armed bandit associative or nonassociative?  Why?
Now imagine you are in Vegas pathetically guarding and working a row of 1-armed bandits.  Assume one of the machines hits the jackpot and dumps all of its accumulated coins.  Is this an associative or non associative task?  Why?
Summarize the chapter's conclusions on the state of the art.  

Programming Assignment

HW1

Due Monday 1/27 at the beginning of class.  An improvised, informal presentation of your work may be requested in class.

Choose one:

  1. Exercise 2.2
  2. Exercise 2.7
  3. Read section 2.8 on reinforcement comparison.  Implement and compare this method to e-greedy action selection, producing a graph similar to Figure 2.5.
  4. Read section 2.9 on pursuit methods.  Implement and compare this method to e-greedy action selection, producing a graph similar to Figure 2.6.
  5. Choose your own adventure: Propose an alternate exercise ASAP. This exercise should be at least as challenging and relevant to the chapter material as the previous choices.  For example, one might choose a different RL application.

HW2

Due Monday 2/3 at the beginning of class.  An improvised, informal presentation of your work may be requested in class.

Choose and do one of the above chapter 2 exercises you did not do last week.

Notes: