CS 104: Introduction to Computer Science

Q Learning

•Can't learn optimal policy p* directly – no <s,a> training pairs

•However, agent can seek to learn V* - the value function of the optimal policy:

–p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]

–recall Vp(st) = rt + grt+1 + g2rt+2 + …

–can only learn V* if you have perfect knowledge of r and d (don't necessarily)

–let Q(s,a) = r(s,a) + gV*(d(s,a))

–Then p*(s) = a maximizing Q(s,a)

•Learn Q à learn p* without knowledge of r and d!