CS 391 - Special Topic: Machine Learning
Chapter 2 |

- 1/22: Sections 2.1-2.3 [2.4]
- 1/24: Sections 2.5-2.7 [2.8-2.9] 2.10-2.11

- The n-armed bandit problem
- Action-value methods: sample-average, greedy, epsilon-greedy (e-greedy)
- Softmax action selection
- Incremental action-value updating
- Nonstationary problems and action-value update step-size
- Optimistic initial action-value estimates
- Generalization to associative search

What is the difference between *evaluative* feedback and *instructive*
feedback?

Describe the n-armed bandit problem. What would be the model and the
reward function for an n-armed bandit RL system?

Write equation 2.1 and explain what is computed to estimate the action-value
Q*(a).

What is a *greedy* action selection rule?

What is an *e-greedy* action selection rule? Explain how the greedy
action selection rule is a special case of the e-greedy action selection rule.

Under what circumstances are greedy or e-greedy action selection rules more
appropriate?

What is the trade-off as you increase/decrease epsilon?

What is the drawback to e-greedy's exploration/exploitation approach that
softmax action selection addresses?

Consider Equation 2.2 and a 3-armed bandit problem where the expected value of
action n is n. That is Q(a_{k}) = k.

(1) For a low temperature tau, show that softmax is nearly greedy.

(2) For a high temperature tau, show that actions are nearly equiprobable.

(3) For an intermediate temperature tau, show that the probabilities of actions
are ranked (i.e. ordered) according to their expected value.

(Note: Please do the math ahead of time and simply present your tau value and
the results of softmax computation.)

Present the derivation of the incremental update rule.

Why is incremental updating important?

What is the general form of the incremental update rule?

What does is mean to have a *nonstationary* problem?

Why does a constant step-size parameter lead to an *exponential* or *recency-weighted*
*average*?

Is convergence desirable or undesirable for a nonstationary problem? Why?

What does it mean to have *optimistic initial values*?

What is the benefit of using *optimistic initial values*? Does this
apply to nonstationary problems? Why or why not?

Describe the difference between associative and nonassociative tasks.

Is the n-armed bandit associative or nonassociative? Why?

Now imagine you are in Vegas pathetically guarding and working a row of 1-armed
bandits. Assume one of the machines hits the jackpot and dumps all of its
accumulated coins. Is this an associative or non associative task?
Why?

Summarize the chapter's conclusions on the state of the art.

__Programming Assignment__

__HW1__

Due Monday 1/27 at the beginning of class. An improvised, informal presentation of your work may be requested in class.

Choose one:

- Exercise 2.2
- Exercise 2.7
- Read section 2.8 on reinforcement comparison. Implement and compare this method to e-greedy action selection, producing a graph similar to Figure 2.5.
- Read section 2.9 on pursuit methods. Implement and compare this method to e-greedy action selection, producing a graph similar to Figure 2.6.
- Choose your own adventure: Propose an alternate exercise ASAP. This exercise should be at least as challenging and relevant to the chapter material as the previous choices. For example, one might choose a different RL application.

__HW2__

Due Monday 2/3 at the beginning of class. An improvised, informal presentation of your work may be requested in class.

Choose and do one of the above chapter 2 exercises you did not do last week.

Notes:

- Your implementations should make use of (i.e. extend) the Java Reinforcement Learning Interface. This provides a common framework for RL programming and facilitate reuse of code in future assignments.
- If the exercise involves collection of significant amounts of data, you should present it in graphical form (i.e. using Excel or Matlab) if possible. In general, strive to communicate the significance of your experimental data well.
- Document your code and how it can be used. Write at least a paragraph about your code in your README file. How much you should write about your experimental data depends on how much there is (see previous point).
- A single run of an experiment does not make a point strongly. The more trials, the less uncertainty of the significance of your results.