Homework #1

CS 391 Selected Topics: Game AI
Homework #1

Due the beginning of class on Thursday 2/2.

Note: This work is to be done in groups of 2. Each group will submit one assignment. Although you may divide the work, both team members should be able to present/describe their partner's work upon request.

0. HW1 Preparation: Download and study the HW1 starter code here.

1. N-Armed Bandit Player Implementation: The challenge problem is described as follows: 10 n-armed bandits are numbered with a 0-based index. Each offered a reward normally distributed with a standard deviation of 1. The bandit reward means themselves are drawn from a normal distribution with mean 0 and standard deviation 1. Each game will consist of 1000 pulls. A player will be tested for 10000 games, proceeding in sequence from an initial random seed.

Testing is initiated by executing NArmedBanditTester, which makes use of the NArmedBandit class (which should not be modified), and a class you specify within the code that implements the NArmedBanditPlayer interface. The testing code is initially set up to use the given RandomPlayer as a usage illustration.

Performance is judged according to average cumulative expected regret. The expected regret of an arm pull is the difference between the mean reward of the best arm and the arm pulled. (There is zero expected regret from pulling the best arm.) The expected regrets are accumulated over each game. The average cumulative expected regret is the average of these cumulative expected regrets over all games.

According to the NArmedBanditPlayer interface, implement an epsilon-Greedy player, a UCB1 player, and a third player of your choice. The third player may either be specified in the readings or of your own design. In your NArmedBanditTester, have three lines for initializing your players with your best problem parameters. Comment out two of these lines, leaving the initialization of the player that works best uncommented.

(A correct implementation of epsilon-Greedy, even crudely tuned, should yield a player that consistently scores an average cumulative regret of less than 135. For seed zero, a player of my design can score 62.35.)