![]() |
CS 371 - Introduction to Artificial Intelligence
4th Hour Project: Open AI Gym Q-Learning |
Note: You may work on this in either pairs or independently for separate grades. No groups of 3 or more are permitted.
Implement basic Q-learning through the Deeplizard Frozen Lake tutorial:
num_episodes
or max_steps_per_episode
)
so as to get a consistent success rate above 0.7.
It is possible, just by tuning min_exploration_rate
and
exploration_decay_rate
.
pip install -e
" from the root
directory of this unzipped new environment. See the linked documentation in
the previous bullet for more details.
# Approach n OpenAI Gym Environment The dice game "Approach n" is played with 2 players and a single standard 6-sided die (d6). The goal is to approach a total of n without exceeding it. The first player roll a die until they either (1) "hold" (i.e. end their turn) with a roll sum less than or equal to n, or (2) exceed n and lose. If the first player holds at exactly n, the first player wins immediately. If the first player holds with less than n, the second player must roll until the second player's roll sum (1) exceeds the first player's roll sum without exceeding n and wins, or (2) exceeds n and loses. Note that the first player is the only player with a choice of play policy. For n >= 10, the game is nearly fair, i.e. can be won approximately half of the time with optimal decisions. - Non-terminal states: Player 1 turn totals less than n (n is a terminal state) - Actions: 0 (HOLD), 1 (ROLL) - Rewards: +1 for transition to terminal state with win, -1 for transition to terminal state with loss, 0 otherwise - Transitions: Each die roll of 1 - 6 is equiprobable
import gym
import gym_approach_n
import random
import time
import numpy as np
from IPython.display import clear_output
env = gym.make("approach-n-v1")
env.n = n = 10 # You can try out different value of n here.
action_space_size = 2 # 0=HOLD, 1=ROLL
state_space_size = n + 1 # nonterminal Player 1 totals 0 through
# (n - 1); terminal state is n
print('Policy: Hold at {}\n'.format(', '.join([str(i) for i in range(n) if q_table[i,0] > q_table[i,1]])))
When demonstrating your agent's play, you'll benefit from longer delays between text-rendering updates.
Optional Challenge: Using the default Q-learning parameters of the tutorial, the agent usually learns the optimal "Hold at 8, 9" policy for n=10. (A total of 10 is an automatic win.) However, it will sometimes learn "Hold at 7, 8, 9" or "Hold at 9". What are ways you can improve the learning so as to more reliably learn the optimal policy?
Demonstrate a form of Q-learning for a third problem of your choice that adds something to your understanding of and experience with Q-learning. Suggestions:
This work will be shared in the 2nd-to-last class of the semester.