Chapter 1

CS 391 - Special Topic: Machine Learning
Chapter 1

Readings

1/17: Sections 1.1-1.3
1/20: Sections 1.4-1.5 [1.6-1.7], Kerr & Neller's Java Resources for Teaching Reinforcement Learning (skip TD material)

Topics

Reinforcement learning (RL) problem
Exploration versus exploitation
RL system: policy, reward function, value function, model
Tic-Tac-Toe temporal difference (TD) learning example
[History]

Discussion Questions

What is reinforcement learning (RL)?
What is supervised learning and how is it different from RL?
What is the exploration-exploitation tradeoff and why is it important?
What are examples of reinforcement learning applications? (See also the case studies of Ch. 11)
Imagine and describe a potential reinforcement learning application not mentioned in chapter 1 or 11.
What are the four main parts of a RL system?
What is a policy? What kind of policy does a RL agent seek to learn?
What is a reward function? What would be the biological analogue?
What is a value function? How does it differ from the reward function? How are they related?
What is a model? How do planning techniques use models differently?
Describe the Tic-Tac-Toe example in terms of these parts.
What is a greedy policy? Why can a RL agent learn suboptimal behavior with a greedy policy?
Write and explain the Tic-Tac-Toe example value function update. Rewrite the update with the right-hand side having only a single V(s) term. What is the parameter alpha, what values can alpha have, and how does it affect learning?
Why is this a temporal-difference learning method?
RL is not limited to Tic-Tac-Toe. In what respects is it more generally applicable?

Optional:
Overview the history of reinforcement learning, describing the three main threads and the most important and influential works.
Discuss possible RL project ideas for later this semester.

Programming Assignment

None. (See reading above.)