CS 371: Introduction to Artificial Intelligence
 Probabilistic Reasoning

Introduction to Uncertain Reasoning
 Qualification problem in FOL e.g. car doesn’t start Problems: no exceptions? Þ big rules! no knowledge of likelihood of each exceptions no complete theory (e.g. medical science) even given complete rules, sometime we only have partial evidence

Probability to the Rescue!
 Possible solution: probabilities summarizes uncertainty gives likelihood information, incomplete theory can be refined, can handle partial evidence, but… rules can still be big Þ stay tuned for simplifying assumptions How might an agent use probabilities to choose actions given percepts?

Decision-Theoretic Agent
 Iterate: Update evidence with current percept Compute outcome probabilities for actions Select action with maximum expected utility given probable outcomes utility - the quality of being useful decision theory = probability theory + utility theory

Prior Probabilities
 unconditional or prior probabilities – probabilities without prior information (i.e. before evidence). P(A) is the probability of A in the absence of other information. Suppose we have a discrete random variable Weather that can take on 4 values: Sunny, Rainy, Cloudy, or Snowy. How to form prior probabilities?

Forming Prior Probabilities
 In absence of any information at all, we might say all outcomes are equally likely. Better, however to apply some knowledge to choice of prior probabilities (e.g. weather statistics over many years). P(Weather) = <0.7, 0.2, 0.08, 0.02> (probability distribution over random variable Weather) What about low probability events that have never happened or happen too infrequently to have accurate statistics?

Where do Probabilities Come From?
 Frequentist view – probabilities from experimentation Objectivist view – probabilities real values frequentists approximate Subjectivist view – probabilities reflect agent degrees of belief

Conditional Probabilities
 conditional or posterior probabilities – probabilities with prior information (i.e. after evidence) P(A|B) is the probability of A given that all we know is B. P(Weather=Rainy|Month=April) Is P(BÞA) equal to P(A|B)? Product Rule: P(AÙB)  = P(A|B)P(B)

Axioms of Probability
 All probabilities are between 0 and 1. Necessarily true and false propositions have probability 1 and 0, respectively. The probability of a disjunction is given by P(AÚB) = P(A) + P(B) - P(AÙB) From these three axioms, all other properties of probabilities can be derived.

Why Are These Axioms Reasonable?
 de Finetti’s betting argument: Put your money where your beliefs are. If agent 1 has a set of beliefs inconsistent with the axioms of probability, then there exists a betting strategy for agent 2 that guarantees that agent 1 will lose money. practical results have made an even more persuasive arguments (e.g. Pathfinder medical diagnosis)

Joint Probability Distribution
 Atomic event - an assignment of values to variable; a specific state of the world For simplicity, we'll treat all variables as Boolean (e.g. P(A), P(ØA), P(A^B)) Joint probability P(X1,X2,…,Xn) - a function mapping atomic events to probabilities for atomic events

Joint Probability Example
 What's the probability of having a cavity given the evidence of a toothache? Like a lookup table for probabilities: can easily have too many entries for practical entry Þ motivation for conditional probabilities

Bayes’ Rule
 Bayes’ Rule underlies all modern AI systems for probabilistic inference two forms of product rule: P(AÙB) = P(AÙB) = Now use these two to form an equation for: P(B|A) =

Bayes’ Rule
 Bayes’ Rule underlies all modern AI systems for probabilistic inference two forms of product rule: P(AÙB) = P(A|B) P(B) P(AÙB) = P(B|A) P(A) Now use these two to form an equation for: P(B|A) = P(A|B) P(B) / P(A)

Applying Bayes’ Rule
 What's Bayes' Rule good for?  Need three terms to compute one! Often you only have the three and need the fourth. Example: M = patient has meningitis S = patient has stiff neck

Applying Bayes’ Rule (cont.)
 Given: P(S|M) = 0.5 P(M) = 1/50000 P(S) = 1/20 What's the probability that a patient with a stiff neck has meningitis?

Applying Bayes’ Rule (cont.)
 Given: P(S|M) = 0.5 P(M) = 1/50000 P(S) = 1/20 What's the probability that a patient with a stiff neck has meningitis? P(M|S) = P(S|M) P(M) / P(S) = 0.5 * (1/50000) / (1/20) = 0.5 * 20 / 50000 = 10/50000 = 1/5000

Relative Likelihood
 Now suppose we don't know the probability of a stiff neck, but we do know: the probability of whiplash P(W) = (1/1000) the probability of a stiff neck given whiplash P(S|W) = 0.8 What is the relative likelihood of meningitis and whiplash given a stiff neck? Write Bayes' Rule for each and write P(M|S)/P(W|S)

Relative Likelihood
 Now suppose we don't know the probability of a stiff neck, but we do know: the probability of whiplash P(W) = (1/1000) the probability of a stiff neck given whiplash P(S|W) = 0.8 What is the relative likelihood of meningitis and whiplash given a stiff neck? Write Bayes' Rule for each and write P(M|S)/P(W|S) P(M|S)/P(W|S) = (P(S|M)P(M)/P(S)) / (P(S|W)P(W)/P(S)) = (P(S|M) P(M))/(P(S|W) P(W)) = (0.5*(1/50000))/(0.8*(1/1000)) = .00001 / .0008 = 1/80

Normalization
 Write Bayes' Rule for P(M|S) Now write Bayes' Rule for P(ØM|S) We know P(M|S) + P(ØM|S) = 1 Use these to write a new expression for P(S) Substitute this expression in Bayes' Rule for P(M|S) One does not need P(S) directly.

Normalization (cont.)
 The main point however, is that 1/P(S) is a normalizing constant that allows conditional terms to sum to one. P(M|S) = a P(S|M) P(M) where a = 1/P(S) is a normalizing constant such that  P(M|S) + P(ØM|S) = 1

Conditional Independence
 What's the probability of my having a cavity given that I stubbed my toe? Often, there is no direct causal link between two things: direct:  burglary à alarm cavity à toothache disease à symptom defect à failure indirect: burglary à alarm company calls cavity à dentist called about toothache disease à symptom noted defect à failure caused by failure

Conditional Independence (cont.)
 The size of a table for a joint probability distribution can easily become enormous (exponential in number of variables). How can one represent a joint probability distribution more compactly?

Belief Networks
 Assume variables are conditionally independent by default. Only represent direct causal links (conditional dependence) between random variables. Belief network or Bayesian network: set of random variables (nodes) set of directed links (edges) indicating direct influence of one variable on another. a table for each variable, supplying conditional probabilities of the variable for each assignment of its parents no directed cycles (network is a DAG)

Choosing Variables
 From Cooper [1984]: "Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium.  In turn, either of these could explain a patient falling into a coma.  Severe headache is also possible associated with a brain tumor." What are our variables? What are the direct causal influences between them?

Identifying Direct Influences
 Let: A = Patient has metastatic cancer B = Patient has increased total serum calcium C = Patient has a brain tumor D = Patient lapses occasionally into coma E = Patient has a severe headache What are the direct causal links between these variables? "Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium.  In turn, either of these could explain a patient falling into a coma.  Severe headache is also possible associated with a brain tumor." Draw the belief net.

Conditional Probability Tables (CPTs)

Probabilistic Reasoning
 From the joint probability distribution, we can answer any probability query. From the conditional (in)dependence assumptions and CPTs of the belief network, we can compute the joint probability distribution. Therefore, a belief network has the probabilistic information to answer any probability query. How do we compute the joint probability distribution from the belief network?

Computing Joint Probabilities with CPTs
 Denote our set of variables as X1, X2, …, Xn. The joint probability distribution P(X1,…,Xn) can be thought of as a table with entries P(X1=x1,…,Xn=xn) or simply P(x1, …, xn) where x1,…,xn is a possible assignment to all variables. Using CPTs, P(x1, …, xn)  = P(x1|ParentValues(x1)) * … * P(xn|ParentValues(xn))

Joint Probability Computation Example

Markov Blanket
 Suppose we want to know the probability of each variable's values given all other variable values. Recall P(x1, …, xn)  = P(x1|ParentValues(x1)) * … * P(xn|ParentValues(xn)) In computing P(x1, …, xi, …, xn), which of the terms in the above product involve xi? How would you describe the variables which appear in those terms? (see example) These neighboring variables are called Xi's Markov blanket.

Markov Blanket (cont.)
 Since all other terms in the product (from CPTs other than that of Xi and its children) do not include Xi, they are constant relative to Xi, and they can be replaced by a normalizing factor. (Proof?) (Worked example)

Joint Probability Computation Example

Utility Theory

Utility of Money

Decision Networks

Value of Information