CS 104: Introduction to Computer Science

Q Learning

•

Can't learn optimal policy p* directly – no <s,a>

training pairs

•

However, agent can seek to learn V^* - the value

function of the optimal policy:

–

p*(s) = a maximizing [r(s,a) + gV*(d(s,a))]

–

recall V^p(s_t) = r_t + gr_t+1+ g²r_t+2 + …

–

can only learn V* if you have perfect knowledge of r

and d (don't necessarily)

–

let Q(s,a) = r(s,a) + gV*(d(s,a))

–

Then p*(s) = a maximizing Q(s,a)

•

Learn Q à learn p* without knowledge of r and d!