|
|
|
|
|
Learning is such an important part of what we
consider "intelligence" that it appears in one common definition: |
|
intelligence: the ability to learn or understand
or to deal with new or trying situations. (Webster's) |
|
Intelligent agents make mistakes, but one might
argue that they don't make the same mistakes perpetually. |
|
|
|
|
If-then decision structures (decision trees) |
|
Function approximation (neural networks) |
|
Action (control) policy (reinforcement learning) |
|
etc. à lots of things! |
|
Learning agents have at least one adaptive
component of their architecture. |
|
|
|
|
|
Connectionism – intelligence bottom up |
|
Small, simple components |
|
Connected together in a large network |
|
Give rise to complex (intelligent) behaviors. |
|
Can complex behavior be learned from a simple
process? |
|
Our brief foray into artificial neural networks
(ANNs) will limit itself to a few simple goals: |
|
|
|
|
|
Understand |
|
the perceptron, the basic unit of ANN
computation (like transistor is to circuits), |
|
the perceptron learning rule, |
|
the class of functions a perceptron can
represent, |
|
multilayer feed-forward networks, |
|
the back-propogation learning algorithm, and |
|
the momentum variant of back-propogation. |
|
Experience the strengths/weaknesses of
multilayer feed-forward methods through experimentation. |
|
|
|
|
|
|
Basic unit of brain computation |
|
Dendrites |
|
many local to cell body |
|
senses input |
|
Axon |
|
one reaching ~1cm from cell body |
|
transmits output |
|
Axons connect to dendrites through synapses |
|
|
|
|
|
Complex electrochemical process: |
|
Synapses release chemicals… |
|
Chemicals increase dendrite electrical
potential… |
|
When potential reaches a threshold, … |
|
An electrical pulse (action potential) goes down
axon to synapses, so … |
|
Synapses release chemicals… |
|
|
|
|
Computers have a much faster clock speed. |
|
Brains are much, much more parallel. à more
unit updates per sec than computer |
|
Brains are more adaptive à grow
into tasks |
|
Brains exhibit graceful degradation: gradual
rather than sharp drop off in performance and conditions worsen |
|
|
|
|
|
Brain has many desirable characteristics that
most computers lack |
|
plasticity, self-adaptivity |
|
massive parallelism |
|
graceful degradation |
|
What would be a simple computation unit from
which to build a "computer brain"? |
|
|
|
|
weighted sum of inputs: ini = sumj(Wj,i´aj) |
|
output from activation function: aj =
g(ini) |
|
|
|
|
(a) t = step threshold (can replace with extra
input weight W0,i = t & fixed a0 = -1) |
|
(c) sigmoid(x) = 1/(1 + e-x) |
|
|
|
|
|
Suppose you have a 2-input unit with a step
function and a fixed threshold t. |
|
Let x, y be inputs. |
|
What set of points on the x-y plane are at the
unit's threshold? (Simplify equations.) |
|
Answer: The line Wx,i´x + Wy,i´y = t |
|
rewritten: y = (-Wx,i/Wy,i)x
+ (t/Wy,i) |
|
|
|
|
|
For values 0,1 corresponding to true, false, and
a unit with a 0/1-step function, can you choose W1,i, W2,i,
and t so as to compute: |
|
AND? |
|
OR? |
|
IMPLIES? |
|
EQUIVALENT? |
|
|
|
|
|
For values 0,1 corresponding to true, false, and
a unit with a 0/1-step function, can you choose W1,i, W2,i,
and t so as to compute: |
|
AND? (1, 1, 1.5) |
|
OR? (1, 1, .5) |
|
IMPLIES? (1, -1, .5) |
|
EQUIVALENT? (NOT POSSIBLE – Why?) |
|
|
|
|
One 2-input unit activates for all inputs on one
side of a line |
|
3-inputs à plane,
n-inputs à hyperplane |
|
|
|
|
|
|
A perceptron has |
|
input units Ij |
|
input weights Wj |
|
step activation function step0 |
|
output O |
|
O = step0(sumj(Wj´Ij)) |
|
|
|
|
|
Suppose one randomizes initial weights and has a
set of desired input, output pairs. |
|
Iterate: |
|
Compute O from inputs |
|
Compute error Err = T – O from correct output T |
|
Adjust weights: Wj ß Wj
+ a´Ij´Err where a is the learning rate. |
|
|
|
|
Perceptron learning is a gradient descent search
through the space of possible weights. |
|
Each training example provides an "error
surface" for weights. Learning
rule runs weights downhill with learning rate a as step size. |
|
For linearly separable functions, there are no
local minima, and guaranteed to converge if learning rate a not too
high (overshoot) |
|
Summary: Very effective for very simple
representable functions. |
|
|
|
|
|
Is there any hope for learning functions that
are not linearly separable? |
|
Yes, but a perceptron network isn't enough. |
|
One needs more than one layer of units between
inputs and outputs to compute other functions. |
|
With enough "hidden" units (units
within), any boolean function is computable, and any continuous function is
approximable. |
|
|
|
|
|
|
|
Basic idea: Supply training inputs, computation
feeds forward, error computed with training output, error propagates
backward for weight updates. |
|
Start with final layer |
|
Update output weights of layer according to
layer output error as with perceptron learning rule |
|
Assign error to units of previous layer
according to weights |
|
Repeat this process backwards through layers |
|
|
|
|
Error computation makes use of the slope of the
activation function, so we need to use continuous activation functions. |
|
The sigmoid function is typical.
sigmoid(x) = 1/(1 + e-x)
sigmoid'(x) = sigmoid(x)(1 – sigmoid(x)) |
|
Error
term Di = Erri*g'(ini) |
|
|
|
|
Updates to output units:
W'j,i
ß Wj,i + a´aj´Di |
|
Computation of error for previous layer
units:
Dj ß g'(inj) ´ sumi(Wj,i´Di) |
|
Process continues with previous layer:
W'k,j
ß Wk,j + a´ak´Dj
Dk
ß g'(ink) ´ sumj(Wk,j´Dj) |
|
Repeat until input layer is reached (e.g. ak
= Ik) |
|
|
|
|
|
|
When updating a weight, also add the previous
update to that weight times a momentum constant m (0.0 <= m < 1.0). |
|
Possible to carry weights |
|
across plateaux in error surface |
|
through local minima to global minima |
|
through global minima to local minima (i.e. can
have undesirable effects as well). |
|
|
|