Homework #3

CS 391 Selected Topics: Game AI
Homework #3

Due the beginning of class on Thursday 2/16.

Note: This work is to be done in groups of 2. Each group will submit one assignment. Although you may divide the work, both team members should be able to present/describe their partner's work upon request.

0. HW3 Preparation: Download the HW3 starter code here.

1. TD(0) SARSA Learning of Piglet Solitaire: Piglet Solitaire is a simple jeopardy coin game. The goal is to reach a given goal score in a given number of turns. For this exercise, the goal score is 6 and the number of turns is 10.

Initially, a player's score is 0, the turn total is zero, and the turn number is 0. (The last possible turn number is the number of turns - 1.)
A player's choice is always simply whether to "flip" a coin or to "hold" and end the turn.
If the player chooses to "flip" there are two equiprobable outcomes:
- HEAD - the turn total increases by 1, and the turn continues, or
- TAIL - the turn total resets to 0, and the turn ends with the score unchanged. Note that the turn number increments at the end of a turn.
If the player chooses to "hold", the score is increased by the turn total, the turn total resets to 0, and turn ends.
The player wins if, within a given number of turns, the player's score reaches a given goal score.
An optimal player chooses to "flip" or "hold" so as to maximize the probability of winning.

Implementing the PigletSolitairePlayer interface, create a player that, at time of construction, simulates play and computes an approximately optimal policy using the SARSA algorithm. Your algorithm should complete its computation within one minute on our lab machines. You will test your implementation using the PigSolitairePlayerEvaluator class.

Self-check:

Highlight the following to see the output for an optimal player: p[0][0][0] = 0.5418925201520324

Note that your policy will likely not be optimal. Rather, it should approximate optimal play. You are free to choose a policy underlying SARSA that is not epsilon-greedy. This policy may dynamically adjust behavior as training progresses. You may also experiment with varying the learning rate. However, the algorithm should essentially be SARSA. How close to optimal can you get?

The submission should have the PigletSolitairePlayerEvaluator modified to evaluate your implementation rather than the included hold-at-2 player (which is not a bad policy).