 |
CS 391 Selected Topics: Game AI Homework #1 |
Due the beginning of class on Thursday 2/2.Note: This work is to be done in groups of 2. Each group will submit one assignment. Although
you may divide the work, both team members should be able to present/describe
their partner's work upon request.
0. HW1 Preparation: Download and study the HW1 starter code
here.
1. N-Armed Bandit Player Implementation: The challenge
problem is described as follows: 10 n-armed bandits are numbered with a 0-based
index. Each offered a reward normally distributed with a standard
deviation of 1. The bandit reward means themselves are drawn from a normal
distribution with mean 0 and standard deviation 1. Each game will consist
of 1000 pulls. A player will be tested for 10000 games, proceeding in
sequence from an initial random seed.
Testing is initiated by executing
NArmedBanditTester, which makes use of the
NArmedBandit class (which should not be
modified), and a class you specify within the code that implements the
NArmedBanditPlayer interface.
The testing code is initially set up to use the given
RandomPlayer as a usage illustration.
Performance is judged according to average cumulative expected regret.
The expected regret of an arm pull is the difference between the mean reward of
the best arm and the arm pulled. (There is zero expected regret from
pulling the best arm.) The expected regrets are accumulated over each
game. The average cumulative expected regret is the average of these
cumulative expected regrets over all games.
According to the
NArmedBanditPlayer interface, implement an epsilon-Greedy player, a UCB1 player,
and a third player of your choice. The third player may either be
specified in
the readings or of your own design. In your NArmedBanditTester, have three
lines for initializing your players with your best problem parameters. Comment
out two of these lines, leaving the initialization of the player that works best
uncommented.
(A correct implementation of epsilon-Greedy, even crudely
tuned, should yield a player that consistently scores an average cumulative
regret of less than 135. For seed
zero, a player of my design can score 62.35.)