|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
One
extreme: Always choose action that looks best so far.
|
|
|
– |
What
can potentially happen?
|
|
|
|
– |
Early
bias towards positive reward experience
|
|
|
|
– |
Bias
against exploration for even better reward
|
|
|
• |
Another
extreme: Always choose actions randomly with
|
|
|
equal
probability
|
|
|
|
– |
Ignores
what it has learned à behavior remains random
|
|
|
• |
Want
behavior between greedy and random extremes
|
|
|
• |
Simulated
annealing ideas applicable here: Start with
|
|
|
random
behavior to gather information, gradually become
|
|
|
greedy
to improve performance.
|
|