CS 371: Introduction to Artificial Intelligence

Probabilistic Reasoning

Introduction to Uncertain Reasoning

Qualification problem in FOL

e.g. car doesn’t start

Problems:

no exceptions? Þ big rules!

no knowledge of likelihood of each exceptions

no complete theory (e.g. medical science)

even given complete rules, sometime we only have partial evidence

Probability to the Rescue!

Possible solution: probabilities

summarizes uncertainty

gives likelihood information, incomplete theory can be refined, can handle partial evidence, but…

rules can still be big Þ stay tuned for simplifying assumptions

How might an agent use probabilities to choose actions given percepts?

Decision-Theoretic Agent

Iterate:

Update evidence with current percept

Compute outcome probabilities for actions

Select action with maximum expected utility given probable outcomes

utility - the quality of being useful

decision theory = probability theory + utility theory

Prior Probabilities

unconditional or prior probabilities – probabilities without prior information (i.e. before evidence).

P(A) is the probability of A in the absence of other information.

Suppose we have a discrete random variable Weather that can take on 4 values: Sunny, Rainy, Cloudy, or Snowy.

How to form prior probabilities?

Forming Prior Probabilities

In absence of any information at all, we might say all outcomes are equally likely.

Better, however to apply some knowledge to choice of prior probabilities (e.g. weather statistics over many years).

P(Weather) = <0.7, 0.2, 0.08, 0.02>

(probability distribution over random variable Weather)

What about low probability events that have never happened or happen too infrequently to have accurate statistics?

Where do Probabilities Come From?

Frequentist view – probabilities from experimentation

Objectivist view – probabilities real values frequentists approximate

Subjectivist view – probabilities reflect agent degrees of belief

Conditional Probabilities

conditional or posterior probabilities – probabilities with prior information (i.e. after evidence)

P(A|B) is the probability of A given that all we know is B.

P(Weather=Rainy|Month=April)

Is P(BÞA) equal to P(A|B)?

Product Rule: P(AÙB) = P(A|B)P(B)

Axioms of Probability

All probabilities are between 0 and 1.

Necessarily true and false propositions have probability 1 and 0, respectively.

The probability of a disjunction is given by P(AÚB) = P(A) + P(B) - P(AÙB)

From these three axioms, all other properties of probabilities can be derived.

Why Are These Axioms Reasonable?

de Finetti’s betting argument: Put your money where your beliefs are.

If agent 1 has a set of beliefs inconsistent with the axioms of probability, then there exists a betting strategy for agent 2 that guarantees that agent 1 will lose money.

practical results have made an even more persuasive arguments (e.g. Pathfinder medical diagnosis)

Joint Probability Distribution

Atomic event - an assignment of values to variable; a specific state of the world

For simplicity, we'll treat all variables as Boolean (e.g. P(A), P(ØA), P(A^B))

Joint probability P(X1,X2,…,Xn) - a function mapping atomic events to probabilities for atomic events

Joint Probability Example

What's the probability of having a cavity given the evidence of a toothache?

Like a lookup table for probabilities: can easily have too many entries for practical entry Þ motivation for conditional probabilities

Bayes’ Rule

Bayes’ Rule underlies all modern AI systems for probabilistic inference

two forms of product rule:

P(AÙB) =

P(AÙB) =

Now use these two to form an equation for:

P(B|A) =

Bayes’ Rule

Bayes’ Rule underlies all modern AI systems for probabilistic inference

two forms of product rule:

P(AÙB) = P(A|B) P(B)

P(AÙB) = P(B|A) P(A)

Now use these two to form an equation for:

P(B|A) = P(A|B) P(B) / P(A)

Applying Bayes’ Rule

What's Bayes' Rule good for? Need three terms to compute one!

Often you only have the three and need the fourth.

Example:

M = patient has meningitis

S = patient has stiff neck

Applying Bayes’ Rule (cont.)

Given:

P(S|M) = 0.5

P(M) = 1/50000

P(S) = 1/20

What's the probability that a patient with a stiff neck has meningitis?

Applying Bayes’ Rule (cont.)

Given:

P(S|M) = 0.5

P(M) = 1/50000

P(S) = 1/20

What's the probability that a patient with a stiff neck has meningitis?

P(M|S) = P(S|M) P(M) / P(S)

= 0.5 * (1/50000) / (1/20)

= 0.5 * 20 / 50000 = 10/50000 = 1/5000

Relative Likelihood

Now suppose we don't know the probability of a stiff neck, but we do know:

the probability of whiplash P(W) = (1/1000)

the probability of a stiff neck given whiplash P(S|W) = 0.8

What is the relative likelihood of meningitis and whiplash given a stiff neck?

Write Bayes' Rule for each and write P(M|S)/P(W|S)

P(M|S)/P(W|S) = (P(S|M)P(M)/P(S)) / (P(S|W)P(W)/P(S))

                             = (P(S|M) P(M))/(P(S|W) P(W))

                             = (0.5*(1/50000))/(0.8*(1/1000))

                             = .00001 / .0008 = 1/80

Normalization (cont.)

The main point however, is that 1/P(S) is a normalizing constant that allows conditional terms to sum to one.

P(M|S) = a P(S|M) P(M)

where a = 1/P(S) is a normalizing constant such that P(M|S) + P(ØM|S) = 1

Conditional Independence

What's the probability of my having a cavity given that I stubbed my toe?

Often, there is no direct causal link between two things:

direct: burglary à alarm cavity à toothache

disease à symptom defect à failure

indirect: burglary à alarm company calls

cavity à dentist called about toothache

disease à symptom noted

defect à failure caused by failure

Conditional Independence (cont.)

The size of a table for a joint probability distribution can easily become enormous (exponential in number of variables).

How can one represent a joint probability distribution more compactly?

Belief Networks

Assume variables are conditionally independent by default.

Only represent direct causal links (conditional dependence) between random variables.

Belief network or Bayesian network:

set of random variables (nodes)

set of directed links (edges) indicating direct influence of one variable on another.

a table for each variable, supplying conditional probabilities of the variable for each assignment of its parents

no directed cycles (network is a DAG)

Choosing Variables

From Cooper [1984]:
"Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possible associated with a brain tumor."

What are our variables?

What are the direct causal influences between them?

Identifying Direct Influences

Let:

A = Patient has metastatic cancer

B = Patient has increased total serum calcium

C = Patient has a brain tumor

D = Patient lapses occasionally into coma

E = Patient has a severe headache

What are the direct causal links between these variables?
"Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possible associated with a brain tumor."

Draw the belief net.

Conditional Probability Tables (CPTs)

Probabilistic Reasoning

From the joint probability distribution, we can answer any probability query.

From the conditional (in)dependence assumptions and CPTs of the belief network, we can compute the joint probability distribution.

Therefore, a belief network has the probabilistic information to answer any probability query.

How do we compute the joint probability distribution from the belief network?

Computing Joint Probabilities with CPTs

Denote our set of variables as X1, X2, …, Xn.

The joint probability distribution P(X1,…,Xn) can be thought of as a table with entries P(X1=x1,…,Xn=xn) or simply P(x1, …, xn) where x1,…,xn is a possible assignment to all variables.

Using CPTs,
P(x1, …, xn) = P(x1|ParentValues(x1)) * … *
P(xn|ParentValues(xn))

Joint Probability Computation Example

Markov Blanket

Suppose we want to know the probability of each variable's values given all other variable values.

Recall P(x1, …, xn) =
P(x1|ParentValues(x1)) * … *
P(xn|ParentValues(xn))

In computing P(x1, …, xi, …, xn), which of the terms in the above product involve xi?

How would you describe the variables which appear in those terms? (see example)

These neighboring variables are called Xi's Markov blanket.

Markov Blanket (cont.)

Since all other terms in the product (from CPTs other than that of Xi and its children) do not include Xi,

they are constant relative to Xi, and

they can be replaced by a normalizing factor.

(Proof?)

(Worked example)

Joint Probability Computation Example

Utility Theory

Utility of Money

Decision Networks

Value of Information


	Qualification problem in FOL
		e.g. car doesn’t start
	Problems:
		no exceptions? Þ big rules!
		no knowledge of likelihood of each exceptions
		no complete theory (e.g. medical science)
		even given complete rules, sometime we only have partial evidence


	Possible solution: probabilities
		summarizes uncertainty
		gives likelihood information, incomplete theory can be refined, can handle partial evidence, but…
		rules can still be big Þ stay tuned for simplifying assumptions
	How might an agent use probabilities to choose actions given percepts?


	Iterate:
		Update evidence with current percept
		Compute outcome probabilities for actions
		Select action with maximum expected utility given probable outcomes
	utility - the quality of being useful
	decision theory = probability theory + utility theory


	unconditional or prior probabilities – probabilities without prior information (i.e. before evidence).
	P(A) is the probability of A in the absence of other information.
	Suppose we have a discrete random variable Weather that can take on 4 values: Sunny, Rainy, Cloudy, or Snowy.
	How to form prior probabilities?


	In absence of any information at all, we might say all outcomes are equally likely.
	Better, however to apply some knowledge to choice of prior probabilities (e.g. weather statistics over many years).
		P(Weather) = <0.7, 0.2, 0.08, 0.02>
		(probability distribution over random variable Weather)
	What about low probability events that have never happened or happen too infrequently to have accurate statistics?


	Frequentist view – probabilities from experimentation
	Objectivist view – probabilities real values frequentists approximate
	Subjectivist view – probabilities reflect agent degrees of belief


	conditional or posterior probabilities – probabilities with prior information (i.e. after evidence)
	P(A\|B) is the probability of A given that all we know is B.
	P(Weather=Rainy\|Month=April)
	Is P(BÞA) equal to P(A\|B)?
	Product Rule: P(AÙB) = P(A\|B)P(B)


	All probabilities are between 0 and 1.
	Necessarily true and false propositions have probability 1 and 0, respectively.
	The probability of a disjunction is given by P(AÚB) = P(A) + P(B) - P(AÙB)
	From these three axioms, all other properties of probabilities can be derived.


	de Finetti’s betting argument: Put your money where your beliefs are.
	If agent 1 has a set of beliefs inconsistent with the axioms of probability, then there exists a betting strategy for agent 2 that guarantees that agent 1 will lose money.
	practical results have made an even more persuasive arguments (e.g. Pathfinder medical diagnosis)


	Atomic event - an assignment of values to variable; a specific state of the world
	For simplicity, we'll treat all variables as Boolean (e.g. P(A), P(ØA), P(A^B))
	Joint probability P(X1,X2,…,Xn) - a function mapping atomic events to probabilities for atomic events


	What's the probability of having a cavity given the evidence of a toothache?
	Like a lookup table for probabilities: can easily have too many entries for practical entry Þ motivation for conditional probabilities


	Bayes’ Rule underlies all modern AI systems for probabilistic inference
	two forms of product rule:
		P(AÙB) =
		P(AÙB) =
	Now use these two to form an equation for:
		P(B\|A) =


	Bayes’ Rule underlies all modern AI systems for probabilistic inference
	two forms of product rule:
		P(AÙB) = P(A\|B) P(B)
		P(AÙB) = P(B\|A) P(A)
	Now use these two to form an equation for:
		P(B\|A) = P(A\|B) P(B) / P(A)


	What's Bayes' Rule good for? Need three terms to compute one!
	Often you only have the three and need the fourth.
	Example:
		M = patient has meningitis
		S = patient has stiff neck


	Given:
		P(S\|M) = 0.5
		P(M) = 1/50000
		P(S) = 1/20
	What's the probability that a patient with a stiff neck has meningitis?


	Now suppose we don't know the probability of a stiff neck, but we do know:
		the probability of whiplash P(W) = (1/1000)
		the probability of a stiff neck given whiplash P(S\|W) = 0.8
	What is the relative likelihood of meningitis and whiplash given a stiff neck?
	Write Bayes' Rule for each and write P(M\|S)/P(W\|S)


	Write Bayes' Rule for P(M\|S)
	Now write Bayes' Rule for P(ØM\|S)
	We know P(M\|S) + P(ØM\|S) = 1
	Use these to write a new expression for P(S)
	Substitute this expression in Bayes' Rule for P(M\|S)
	One does not need P(S) directly.


	The main point however, is that 1/P(S) is a normalizing constant that allows conditional terms to sum to one.
	P(M\|S) = a P(S\|M) P(M)
		where a = 1/P(S) is a normalizing constant such that P(M\|S) + P(ØM\|S) = 1


	What's the probability of my having a cavity given that I stubbed my toe?
	Often, there is no direct causal link between two things:
		direct: burglary à alarm cavity à toothache
		disease à symptom defect à failure
		indirect: burglary à alarm company calls
		cavity à dentist called about toothache
		disease à symptom noted
		defect à failure caused by failure


	The size of a table for a joint probability distribution can easily become enormous (exponential in number of variables).
	How can one represent a joint probability distribution more compactly?


	Assume variables are conditionally independent by default.
	Only represent direct causal links (conditional dependence) between random variables.
	Belief network or Bayesian network:
		set of random variables (nodes)
		set of directed links (edges) indicating direct influence of one variable on another.
		a table for each variable, supplying conditional probabilities of the variable for each assignment of its parents
		no directed cycles (network is a DAG)


	From Cooper [1984]: "Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possible associated with a brain tumor."
	What are our variables?
	What are the direct causal influences between them?


	Let:
		A = Patient has metastatic cancer
		B = Patient has increased total serum calcium
		C = Patient has a brain tumor
		D = Patient lapses occasionally into coma
		E = Patient has a severe headache
	What are the direct causal links between these variables? "Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possible associated with a brain tumor."
	Draw the belief net.


	From the joint probability distribution, we can answer any probability query.
	From the conditional (in)dependence assumptions and CPTs of the belief network, we can compute the joint probability distribution.
	Therefore, a belief network has the probabilistic information to answer any probability query.
	How do we compute the joint probability distribution from the belief network?


	Denote our set of variables as X1, X2, …, Xn.
	The joint probability distribution P(X1,…,Xn) can be thought of as a table with entries P(X1=x1,…,Xn=xn) or simply P(x1, …, xn) where x1,…,xn is a possible assignment to all variables.
	Using CPTs, P(x1, …, xn) = P(x1\|ParentValues(x1)) * … * P(xn\|ParentValues(xn))


	Suppose we want to know the probability of each variable's values given all other variable values.
	Recall P(x1, …, xn) = P(x1\|ParentValues(x1)) * … * P(xn\|ParentValues(xn))
	In computing P(x1, …, xi, …, xn), which of the terms in the above product involve xi?
	How would you describe the variables which appear in those terms? (see example)
	These neighboring variables are called Xi's Markov blanket.


	Since all other terms in the product (from CPTs other than that of Xi and its children) do not include Xi,
		they are constant relative to Xi, and
		they can be replaced by a normalizing factor.
	(Proof?)
	(Worked example)