CS 104: Introduction to Computer Science

Nondeterministic Rewards and Actions

•What if r and d are nondeterministic (e.g. roll of a die in a game)?

•Vp(st) = ExpectedValue[rt + grt+1 + g2rt+2 + …]
= E[Sumi=0à¥(girt+i)]

•Q(s,a) = E[r(s,a) + g V*(d(s,a))]
= E[r(s,a)] + g E[V*(d(s,a))]
= E[r(s,a)] + g Sums'[P(s'|s,a)V*(s')]

•Therefore:
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]]

•Note: This is not an update rule.