Nondeterministic Rewards and
Actions
What if r and d are nondeterministic (e.g. roll of a
die in a game)?
Vp(st) =  ExpectedValue[rt + grt+1 + g2rt+2 + …]
     = E[Sumi=0à¥(girt+i)]
Q(s,a) = E[r(s,a) + g V*(d(s,a))]
     = E[r(s,a)] + g E[V*(d(s,a))]
     = E[r(s,a)] + g Sums'[P(s'|s,a)V*(s')]
Therefore:
Q(s,a) = E[r(s,a)] + g Sums'[P(s'|s,a) maxa'[Q(s',a')]]
Note: This is not an update rule.