CS 391 - Special Topic: Machine Learning
Chapter 3 |
Reinforcement learning problem architecture
Goals, rewards, and returns
Discounting
Episodic and continuing tasks
The Markov property
Markov decision processes
State- and action-value functions
Optimal value functions
Approximating optimality
Discussion Questions
List our RL base Java classes. How do they relate to one another?
How can one decide where to divide agent from environment?
For a problem involving a person or animal, where would you place this division and why?
Describe an example of a RL problem and sketch how the base classes would represent the problem.
Describe the relationship between an agent’s goal, the RL problem rewards, and returns? Is the goal implicitly or explicitly defined? Why?
What is an episodic task? What is a continuing task?
Why is the sum-of-rewards computation for return problematic for continuing tasks?
Write the equation for the discounted return (3.2). How does discounting address this problem?
Exercise 3.4
Exercise 3.5
How can we formulate an episodic task as a continuing task? What do we need to introduce?
In this unified formalism, the return (3.3) can include “the possibility that T = infinity or gamma = 1 (but not both)”. Why does each preclude the other?
What is the Markov property? Why is this sometimes referred to as an “independence of path” property?
Can we use non-Markovian state signals in reinforcement learning?
What does it mean for a state to be an “approximation to a Markov state”?
What is a Markov decision process (MDP)?
What is finite in a finite MDP?
Write and explain the formal definition of the state-value function for policy pi (V^pi).
Write and explain the formal definition of the action-value function for policy pi (Q^pi).
Write and explain the formal definition of the Bellman equation for V^pi.
What are backup diagrams? What to open and solid circles represent?
Draw the backup diagrams for V^pi and Q^pi.
Exercise 3.8
Exercise 3.10
Exercise 3.12
Exercise 3.13
Define an optimal policy pi* in terms of state-value function V^pi* compared with V^pi for any policy pi. Do the same using action-value function Q^pi*. Is it possible to have more than one optimal policy?
Once one has the optimal state-value function V*, how do you determine the optimal policy?
What are the computational difficulties (in time and space complexity) one faces in computing optimal policies for MDPs? In practical situations, what does one generally do to cope with these difficulties?
Programming Assignment
HW3
Due Monday 2/10 at the beginning of class. An improvised, informal presentation of your work may be requested in class.
Apply one of the techniques from chapter 2 to the Jack's Car Rental problem (Example 4.2, pp. 98-99). For this, you will need to create your own environment to the specification. Also, note that this is an associative problem, so you will need to have a simple state class. When you start a trial, uniformly pick a random possible starting state. Since this is a continuing problem, you will not reach a terminal (a.k.a absorbing) state. Terminate each trial after a maximum number of steps of your choosing. DO NOT return your agent to a naive state at the start of each trial.
Since this is an associative problem, it is recommended that you keep a table to keep track of action-value Q estimates for each possible state-action pair. One update rule you can use following the incremental implementation pattern of section 2.5 is:
Q_{t+1}(s_t,a_t) = Q_t(s_t,a_t) + alpha * ((r_{t+1} + gamma * argmax_a Q_t(s_{t+1},a)) - Q_t(s_t,a_t))
where alpha is the stepsize and gamma is given in Example 4.2 as 0.9. (Underscores are used here to indicate subscripts.) This is an example of what is known as Q-learning (Chapter 6). It is likely that you will have to have many trials to get adequate experience to approximate optimal behavior. At the end of your all of your trials, print out the approximate optimal policy your agent has learned. Compare this policy with that shown in Figure 4.4. A table of numbers is adequate, although a contour plot like that of Figure 4.4 or a 3D plot (x = cars at loc. #1, y = cars at loc. #2, z = cars moved) is preferred.
Reminder: Use/extend/implement the reinforcement base classes provided.