CS 371: Introduction to Artificial Intelligence

Neural Networks

Machine Learning

Learning is such an important part of what we consider "intelligence" that it appears in one common definition:

intelligence: the ability to learn or understand or to deal with new or trying situations. (Webster's)

Intelligent agents make mistakes, but one might argue that they don't make the same mistakes perpetually.

What do agents learn?

If-then decision structures (decision trees)

Function approximation (neural networks)

Action (control) policy (reinforcement learning)

etc. à lots of things!

Learning agents have at least one adaptive component of their architecture.

Connectionism

Connectionism – intelligence bottom up

Small, simple components

Connected together in a large network

Give rise to complex (intelligent) behaviors.

Can complex behavior be learned from a simple process?

Our brief foray into artificial neural networks (ANNs) will limit itself to a few simple goals:

Goals

Understand

the perceptron, the basic unit of ANN computation (like transistor is to circuits),

the perceptron learning rule,

the class of functions a perceptron can represent,

multilayer feed-forward networks,

the back-propogation learning algorithm, and

the momentum variant of back-propogation.

Experience the strengths/weaknesses of multilayer feed-forward methods through experimentation.

The Neuron

The Neuron (cont.)

Basic unit of brain computation

Dendrites

many local to cell body

senses input

Axon

one reaching ~1cm from cell body

transmits output

Axons connect to dendrites through synapses

The Neuron (cont.)

Complex electrochemical process:

Synapses release chemicals…

Chemicals increase dendrite electrical potential…

When potential reaches a threshold, …

An electrical pulse (action potential) goes down axon to synapses, so …

Synapses release chemicals…

Computer & Brain – a comparison

Computers have a much faster clock speed.

Brains are much, much more parallel. à more unit updates per sec than computer

Brains are more adaptive à grow into tasks

Brains exhibit graceful degradation: gradual rather than sharp drop off in performance and conditions worsen

Motivation for Neural Network

Brain has many desirable characteristics that most computers lack

plasticity, self-adaptivity

massive parallelism

graceful degradation

What would be a simple computation unit from which to build a "computer brain"?

The Neural Network Unit

weighted sum of inputs: in_i= sum_j(W_j,i´a_j)

output from activation function: a_j= g(in_i)

Activation Functions

(a) t = step threshold (can replace with extra input weight W_0,i = t & fixed a₀ = -1)

(c) sigmoid(x) = 1/(1 + e^-x)

Understanding What Units Compute

Suppose you have a 2-input unit with a step function and a fixed threshold t.

Let x, y be inputs.

What set of points on the x-y plane are at the unit's threshold? (Simplify equations.)

Answer: The line W_x,i´x + W_y,i´y = t

rewritten: y = (-W_x,i/W_y,i)x + (t/W_y,i)

In-Class Exercise: Units as Logic Gates

For values 0,1 corresponding to true, false, and a unit with a 0/1-step function, can you choose W_1,i, W_2,i, and t so as to compute:

AND?

OR?

IMPLIES?

EQUIVALENT?

Linear Separability

One 2-input unit activates for all inputs on one side of a line

3-inputs à plane, n-inputs à hyperplane

3-Input Unit and Plane of Separation

Perceptrons

A perceptron has

input units I_j

input weights W_j

step activation function step₀

output O

O = step₀(sum_j(W_j´I_j))

Perceptron Learning Rule

Suppose one randomizes initial weights and has a set of desired input, output pairs.

Iterate:

Compute O from inputs

Compute error Err = T – O from correct output T

Adjust weights: W_j ß W_j + a´I_j´Err where a is the learning rate.

Perceptron Learning

Perceptron learning is a gradient descent search through the space of possible weights.

Each training example provides an "error surface" for weights. Learning rule runs weights downhill with learning rate a as step size.

For linearly separable functions, there are no local minima, and guaranteed to converge if learning rate a not too high (overshoot)

Summary: Very effective for very simple representable functions.

Network Learning Algorithm

Is There Hope?

Is there any hope for learning functions that are not linearly separable?

Yes, but a perceptron network isn't enough.

One needs more than one layer of units between inputs and outputs to compute other functions.

With enough "hidden" units (units within), any boolean function is computable, and any continuous function is approximable.

Simple Multilayer Feed-Forward Network

Multilayer Feed-Forward Network with 1 Hidden Layer

Back-Propagation

Basic idea: Supply training inputs, computation feeds forward, error computed with training output, error propagates backward for weight updates.

Start with final layer

Update output weights of layer according to layer output error as with perceptron learning rule

Assign error to units of previous layer according to weights

Repeat this process backwards through layers

Back-Propagation (cont.)

Error computation makes use of the slope of the activation function, so we need to use continuous activation functions.

The sigmoid function is typical.
sigmoid(x) = 1/(1 + e^-x)
sigmoid'(x) = sigmoid(x)(1 – sigmoid(x))

Error term D_i = Err_i*g'(in_i)

Back-Propagation (cont.)

Updates to output units:
W'_j,i ß W_j,i + a´a_j´D_i

Computation of error for previous layer units:
D_j ß g'(in_j) ´ sum_i(W_j,i´D_i)

Process continues with previous layer:
W'_k,j ß W_k,j + a´a_k´D_jD_k ß g'(in_k) ´ sum_j(W_k,j´D_j)

Repeat until input layer is reached (e.g. a_{k =}I_k)

Slide 28

Momentum

When updating a weight, also add the previous update to that weight times a momentum constant m (0.0 <= m < 1.0).

Possible to carry weights

across plateaux in error surface

through local minima to global minima

through global minima to local minima (i.e. can have undesirable effects as well).

Error Surface

Text Notation


	For values 0,1 corresponding to true, false, and a unit with a 0/1-step function, can you choose W_1,i, W_2,i, and t so as to compute:
		AND? (1, 1, 1.5)
		OR? (1, 1, .5)
		IMPLIES? (1, -1, .5)
		EQUIVALENT? (NOT POSSIBLE – Why?)


	Learning is such an important part of what we consider "intelligence" that it appears in one common definition:
		intelligence: the ability to learn or understand or to deal with new or trying situations. (Webster's)
	Intelligent agents make mistakes, but one might argue that they don't make the same mistakes perpetually.


	If-then decision structures (decision trees)
	Function approximation (neural networks)
	Action (control) policy (reinforcement learning)
	etc. à lots of things!
	Learning agents have at least one adaptive component of their architecture.


	Connectionism – intelligence bottom up
		Small, simple components
		Connected together in a large network
		Give rise to complex (intelligent) behaviors.
	Can complex behavior be learned from a simple process?
	Our brief foray into artificial neural networks (ANNs) will limit itself to a few simple goals:


	Understand
		the perceptron, the basic unit of ANN computation (like transistor is to circuits),
		the perceptron learning rule,
		the class of functions a perceptron can represent,
		multilayer feed-forward networks,
		the back-propogation learning algorithm, and
		the momentum variant of back-propogation.
	Experience the strengths/weaknesses of multilayer feed-forward methods through experimentation.


	Basic unit of brain computation
	Dendrites
		many local to cell body
		senses input
	Axon
		one reaching ~1cm from cell body
		transmits output
	Axons connect to dendrites through synapses


	Complex electrochemical process:
		Synapses release chemicals…
		Chemicals increase dendrite electrical potential…
		When potential reaches a threshold, …
		An electrical pulse (action potential) goes down axon to synapses, so …
		Synapses release chemicals…


	Computers have a much faster clock speed.
	Brains are much, much more parallel. à more unit updates per sec than computer
	Brains are more adaptive à grow into tasks
	Brains exhibit graceful degradation: gradual rather than sharp drop off in performance and conditions worsen


	Brain has many desirable characteristics that most computers lack
		plasticity, self-adaptivity
		massive parallelism
		graceful degradation
	What would be a simple computation unit from which to build a "computer brain"?


	weighted sum of inputs: in_i= sum_j(W_j,i´a_j)
	output from activation function: a_j= g(in_i)


	(a) t = step threshold (can replace with extra input weight W_0,i = t & fixed a₀ = -1)
	(c) sigmoid(x) = 1/(1 + e^-x)


	Suppose you have a 2-input unit with a step function and a fixed threshold t.
	Let x, y be inputs.
	What set of points on the x-y plane are at the unit's threshold? (Simplify equations.)
	Answer: The line W_x,i´x + W_y,i´y = t
		rewritten: y = (-W_x,i/W_y,i)x + (t/W_y,i)


	For values 0,1 corresponding to true, false, and a unit with a 0/1-step function, can you choose W_1,i, W_2,i, and t so as to compute:
		AND?
		OR?
		IMPLIES?
		EQUIVALENT?


	One 2-input unit activates for all inputs on one side of a line
	3-inputs à plane, n-inputs à hyperplane


	A perceptron has
		input units I_j
		input weights W_j
		step activation function step₀
		output O
	O = step₀(sum_j(W_j´I_j))


	Suppose one randomizes initial weights and has a set of desired input, output pairs.
	Iterate:
		Compute O from inputs
		Compute error Err = T – O from correct output T
		Adjust weights: W_j ß W_j + a´I_j´Err where a is the learning rate.


	Perceptron learning is a gradient descent search through the space of possible weights.
	Each training example provides an "error surface" for weights. Learning rule runs weights downhill with learning rate a as step size.
	For linearly separable functions, there are no local minima, and guaranteed to converge if learning rate a not too high (overshoot)
	Summary: Very effective for very simple representable functions.


	Is there any hope for learning functions that are not linearly separable?
	Yes, but a perceptron network isn't enough.
	One needs more than one layer of units between inputs and outputs to compute other functions.
	With enough "hidden" units (units within), any boolean function is computable, and any continuous function is approximable.


	Basic idea: Supply training inputs, computation feeds forward, error computed with training output, error propagates backward for weight updates.
		Start with final layer
		Update output weights of layer according to layer output error as with perceptron learning rule
		Assign error to units of previous layer according to weights
		Repeat this process backwards through layers


	Error computation makes use of the slope of the activation function, so we need to use continuous activation functions.
	The sigmoid function is typical. sigmoid(x) = 1/(1 + e^-x) sigmoid'(x) = sigmoid(x)(1 – sigmoid(x))
	Error term D_i = Err_i*g'(in_i)


	Updates to output units: W'_j,i ß W_j,i + a´a_j´D_i
	Computation of error for previous layer units: D_j ß g'(in_j) ´ sum_i(W_j,i´D_i)
	Process continues with previous layer: W'_k,j ß W_k,j + a´a_k´D_jD_k ß g'(in_k) ´ sum_j(W_k,j´D_j)
	Repeat until input layer is reached (e.g. a_{k =}I_k)


	When updating a weight, also add the previous update to that weight times a momentum constant m (0.0 <= m < 1.0).
	Possible to carry weights
		across plateaux in error surface
		through local minima to global minima
		through global minima to local minima (i.e. can have undesirable effects as well).