•One extreme: Always choose action that looks best so far.
–What can potentially happen?
–Early bias towards positive reward experience
–Bias against exploration for even better reward
•Another extreme: Always choose actions randomly with equal probability
–Ignores what it has learned à
behavior remains random
•Want behavior between greedy and random extremes
•Simulated annealing ideas applicable here: Start with random behavior to gather information, gradually become greedy to improve performance.
–