Until now we have seen some basic properties of probability. In this part of the course we will learn how to account for partial information when calculating probabilities.
Suppose that you are playing a game of two dice and you want to bet that the first die will show a 6. A fried of yours rolls the dice and tells you that the sum of the two dice is 7. What is the probability that the first die shows a 6 given that the sum of the two dice is 7? Would this additional information help you when deciding whether to bet on the first die showing a 6?
This is an example of a conditional probability problem.
Definition 7.1 Let A and B be two events. The conditional probability of A given B is defined as
P(A|B) = \frac{P(A \cap B)}{P(B)}
We read P(A | B) as “the probability of A given B”.
Similar to the definition of a probability law that we have seen before, with equally likely outcomes in the sample space, we count the number of outcomes in the event A that are also in the event B and divide by the number of outcomes in B.
In the dice example we have 6^2=36 equally likely outcomes in the sample space.
Now you know that one of the outcomes in B has occurred (otherwise the sum will not be equal to 7). There are 6 outcomes in B and one of them is also in A, so the intersection of A and B is:
A \cap B = \left\{
\begin{align*}
(6, 1) \\
\end{align*}
\right\}
Therefore, the conditional probability of A given B is 1/6.
What about the unconditional probability of A?
P(A) = \frac{6}{36} = \frac{1}{6}
In this case the conditional probability of A given B is the same as the unconditional probability of A, meaning that knowing that the sum was 7 does not help you in deciding whether to bet on the first die showing a 6 (it does not change the odds). We say that the two events are indepedent (under P).
Consider, however, that the friend tells you that the sum of the two dice is 10. What is the probability that the first die shows a 6 given that the sum of the two dice is 10? Let’s denote this event as C.
In this case the event C consists of the following elements:
The elements in A are the same, but now we are looking for the elements which are both in A and C. There is only one element that is in both A and C:
A \cap C = \left\{
\begin{align*}
(6, 4) \\
\end{align*}
\right\}
So the conditional probability of A given C is 1/3.
P(A|C) = \frac{\text{number of elements in } A \cap C}{\text{number of elements in } C} = \frac{1}{3}
So this time the knowledge that the sum is 10 is informative.
7.1 Independence of Two Events
Definition 7.2 (Independence of Two Events) Two events A and B are said to be independent if
P(A \cap B) = P(A) \cdot P(B)
The definition in Definition 7.2 is equivalent to the following, but has the advantage of being symmetric in A and B:
Exercise 7.1 (Independence and Conditional Probability) Show that the definition of independence in Definition 7.2 is equivalent to the following:
P(A|B) = P(A)
P(B|A) = P(B)
Exercise 7.2 (Implications of Independence) Show that if A and B are independent, then the following statements are true:
A and B^c are independent
A^c and B are independent
A^c and B^c are independent
Solution (click to expand)
From the definition of independence we have
P(A \cap B) = P(A) \cdot P(B)
A and B^c are independent
If A and B are independent, then it must be the case that
P(A \cap B^c) = P(A) \cdot P(B^c)
To show this, we can use the fact that B and B^c are disjoint and that B \cup B^c = \Omega
So we showed the condition of independence for A and B^c.
P(A \cap B^c) = P(A) \cdot P(B^c)
Exercise 7.3 (Disjoint Events and Independence) Let A and B be two events such that P(A) > 0 and P(B) > 0. Show that if A and B are disjoint, then they are not independent.
7.2 Mutual Independence
The definition of independence can be extended to more than two events. However, an attempt to extend the definition by saying that A, B, and C are independent if P(A \cap B \cap C) = P(A) \cdot P(B) \cdot P(C) does not ensure that A, B, and C are pairwise independent in the sense of Definition 7.2. If the events are pairwise independent, there is no guarantee that the probability of their intersection is the product of their probabilities.
Definition 7.3 (Mutual Independence) Events A_1, A_2, \ldots, A_n are said to be mutually independent if for every subset I \subseteq \{1, 2, \ldots, n\} we have
Exercise 7.4 (Mutual Independence and Pairwise Independence) Consider a sample space with 9 equally likely outcomes, each consisting of a triple (i, j, k) where i, j, k \in \{1, 2, 3\}.
Consider the following events A_i = \{\text{i-th place of the outcome is 1}\}. Are the events A_1, A_2, and A_3 pairwise independent? Are they mutually independent?
7.3 Bayes’ Theorem
We have already seen that the conditional probability of A given B is defined as
This last equation is known as Bayes’ Theorem and it turns out that it holds in a more general way than here.
Theorem 7.1 (Bayes’ Theorem) Let A and B be two events such that P(A) > 0 and P(B) > 0. Then
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
Furthermore, let A_1, A_2, \ldots, A_n be a partition of the sample space \Omega. A partition is a collection of disjoint events such that their union is the entire sample space:
A_i \cap A_j = \emptyset \quad \text{for all } i \neq j \quad \text{and} \quad \bigcup_{i=1}^n A_i = \Omega
To understand the derivation of Bayes’ Theorem, see that it holds for a very simple partition. The collection of events A and A^c is a partition of the sample space. In this case, Bayes’ Theorem becomes
Exercise 7.5 (Total Probability and Complements) Let A and B be some events such that P(A) > 0 and P(B) > 0. Show that
P(B) = P(B|A) \cdot P(A) + P(B|A^c) \cdot P(A^c)
Note that the sets A and A^c form a partition of the sample space because A \cap A^c = \emptyset and A \cup A^c = \Omega.
Theorem 7.2 (Total Probability Theorem) Let A_1, A_2, \ldots, A_n be a partition of the sample space \Omega. For any event B such that P(B) > 0, we have
P(B) = \sum_{i=1}^n P(B|A_i) \cdot P(A_i)
Proof. The intersection of B with the sample space is B \cap \Omega = B as B is a subset of \Omega (as any event). Note that the sets B \cap A_i are disjoint because the A_i are disjoint. Therefore, we have
B = B \cap \Omega
Because the A_i are a partition of the sample space, their union is the sample space:
\Omega = \bigcup_{i=1}^n A_i
In this way we can represent B as the union of the intersections of B with the A_i:
B = B \cap \Omega = B \cap \left( \bigcup_{i=1}^n A_i \right) = \bigcup_{i=1}^n (B \cap A_i)
The last step follows from the distributive property of the intersection over the union. Now we can use the additivity of the probability measure to write
Exercise 7.6 (Roadside Drug Tests) Imagine a city called Virtue City with 10,000 drivers in any day. Out of all drivers 100 of them use drugs while driving.
The police department introduces a new test for narcotics that has a 95 percent probability of correctly identifying a person who has used narcotics (true positive). There is a one percent probability of falsely identifying a person who has not used narcotics as having used them (false positive).
The police stop a driver, his test comes back positive, and the driver’s car is confiscated. The police claim that there is a 95 percent probability that the driver is a drug user. The driver objects and says that the probability of him being a drug user is much lower. Who is right?
Approach this problem in two ways:
Counting
How many people will have a positive test result if all 100,000 citizens are tested?
What proportion of the positive tests belong to actual drug users?
Apply Bayes’ Theorem
Exercise 7.7 (Spam Filtering) If you have ever used email, then you have probably encountered spam emails. Spam emails are unsolicited messages that are commonly sent in bulk to many recipients. Most email service providers try to protect their users from spam by filtering out these messages. One way to do this is to train an algorithm to recognize spam emails based on the content of the email.
Suppose that the filter receives an email containing the word “Inheritance”. Suppose that 5 percent of spam emails contain this word, while it is only present in 0.1 percent of non-spam emails. Suppose that 80 percent of the emails that the filter receives are spam.
What is the probability that an email containing the word “Inheritance” is spam?
Exercise 7.8 (Decomposition of the Total Probability) Prove the following statements:
P(B) = 1 \implies P(A|B) = P(A) for any event A.
A \subset B \implies P(B|A) = 1 and P(A|B) = P(A)/P(B)
If A \cap B = \emptyset
P(A | A \cup B) = \frac{P(A)}{P(A) + P(B)}
Theorem 7.3 (Chain Rule) Let A, B, and C be events such that P(A) > 0, P(B|A) > 0, and P(C|A \cap B) > 0. Then
P(A \cap B \cap C) = P(A|B\cap C) P(B|C) P(C)
Proof (click to expand)
The proof requires nothing more than the definition of conditional probability. Let D = B \cap C. Then we can write the intersection of the three events as
P(A \cap B \cap C) = P(A \cap D) = P(A|D) P(D) = P(A|B \cap C) P(B \cap C)
Apply the definition of conditional probability to the last term:
P(A|B \cap C) P(B \cap C) = P(A|B \cap C) P(B|C) P(C)
7.4 An Application of the Chain Rule
Recent advances in natural language processing have made possible applications such as chatbots that can interact with users in a natural way. Earlier probabilistic language models were based on the probability of a sentence.
Consider a simple sentence
“I am happy for you”.
One way to estimate the probability of this sentence is to count how many times it appears in a large corpus of text. However, this approach is very limited because long sequences of words are unlikely to appear in any corpus more than once and may not appear at all.
One way to work around this limitation is to express the probability of a sentence (joint occurrence of all the words in the sentence) as the product of the conditional probabilities of each word given the previous words using the chain rule.
It is now much simpler to estimate the conditional probabilities of each word given the previous word. We simply need to count how many times each word appears after the previous word in a large corpus of text.
%pip install nltkimport nltknltk.download('gutenberg')nltk.download('punkt')nltk.download('punkt_tab')from nltk.corpus import gutenbergfrom nltk.probability import FreqDistfrom nltk.util import ngrams# Import Alice in Wonderland from the gutenber corpusalice = gutenberg.raw('carroll-alice.txt')
Requirement already satisfied: nltk in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (3.9.1)
Requirement already satisfied: click in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (2024.9.11)
Requirement already satisfied: tqdm in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (4.66.5)
Note: you may need to restart the kernel to use updated packages.
[nltk_data] Downloading package gutenberg to /home/runner/nltk_data...
[nltk_data] Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /home/runner/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /home/runner/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
# Use FreqDist to find the frequency of the specific sequencesequence ="Alice said hello"# Create n-grams from the tokenssequence_tokens = nltk.word_tokenize(sequence)tokens = nltk.word_tokenize(alice)ngrams_list =list(ngrams(tokens, len(sequence_tokens)))fdist = FreqDist(ngrams_list)sequence_frequency = fdist[tuple(sequence_tokens)]print(f"The sequence '{sequence}' appears {sequence_frequency} times in the text.")
The sequence 'Alice said hello' appears 0 times in the text.
Exercise 7.9 (Quality Control) A factory produces light bulbs. To ensure the quality of the bulbs shipped to customers, the factory has a quality control system that checks a sample of the bulbs before they are shipped. Consider a shipment of 100 bulbs. You take a sample (without replacement) of 5 bulbs and stop the shipment if you find that at least one of the sampled bulbs are defective. Suppose that a shipment contains 20 defective bulbs. What is the probability that the shipment will be stopped?
# Simulate a sample from a shipment of 100 light bulbsimport numpy as npsample_size =5repetitions =10# 0 represents a good light bulb, 1 represents a defective light bulbshipment = np.array([0] *80+ [1] *20)print("The shipment of light bulbs")print(shipment)print("\n")rejections =0print("Some of the selected samples")for i inrange(repetitions): sample = np.random.choice(shipment, size = sample_size, replace =False)# Print the sample once every 1000 repetitionsif i %1000==0:print(sample)if np.sum(sample) >0: rejections +=1print("\n")print(f"The proportion of shipments rejected is {rejections / repetitions}")
We need to compute the probability that at least one of the sampled bulbs is defective, given that there are 20 defective bulbs in the shipment of 100 bulbs. We can define the event A as the event that at least one of the sampled bulbs is defective.
A = \text{at least one defective bulb in the sample}
Let A_1, A_2, \ldots, A_{5} be the events that the i-th bulb in the sample is defective. Then we have
A = A_1 \cup A_2 \cup \ldots \cup A_{5}
It is easy to compute the probability of the complement of A using de Morgan’s law
# Compute the probability of a shipment being rejected
Exercise 7.10 (A Bifurcation in the Road) A tourists stands at a bifurcation in the road and does not know which is the correct way to go. He asks two passersby. Each passerby points him in the correct direction with probability 0.7 independently of each other. We want to propose a strategy for the tourist to decide which way to go.
Strategy 1: If the two passersby point in the same direction, the tourist follows that direction. If they point in different directions, the tourist chooses randomly.
Strategy 2: The tourist decides to follow the direction of the first passerby only.
Strategy 3: The tourist decides at random.
Which strategy would you recommend? Decide this by calculating the probability of the tourist going in the correct direction for each strategy.
Simulate the three strategies and estimate the probability of the tourist going in the correct direction. For the sake of the simulation, assume that the correct direction is the first one.
import pandas as pdimport numpy as np# Setup the simulationrepetitions =10# Simulate the third strategydf = pd.DataFrame({"passerby1": np.random.choice([1, 2], size = repetitions, p = [0.7, 0.3]),"passerby2": np.random.choice([1, 2], size = repetitions, p = [0.7, 0.3]),"random_choice": np.random.choice([1, 2], size = repetitions, p = [0.5, 0.5])})# Checks if the two passerby agree on the direction. If yes, then take the direction of the first passerby # (both are equal). If not, then take the random directiondf["strategy1"] = np.where( df["passerby1"] == df["passerby2"], df["passerby1"], df["random_choice"])df.head()
Let us consider the scenarios leading to the correct choice, assuming the passersby are independent and point in the right direction with probability p and in the wrong direction with probability 1-p.
The passersby agree and are correct with probability p^2.
The first passerby is right and the the tourist follows his or her advice p(1 - p)/2.
The second passerby is right the tourist follows his or her advice: (1 - p)p /2.
These events are mutually exclusive and exhaustive, so the probability of the tourist going in the correct direction under Strategy 1 is
Imagine a game with three doors. Behind one of the doors is a car and behind the other two are goats. The game is played in two stages.
First you choose a door.
The host, who knows what is behind each door, opens one of the other two doors to reveal a goat.
You are then given the opportunity to switch to the other unopened door or stay with your original choice.
Which strategy has a higher probability of winning the car? Should you switch or should you stay with your original choice? Compute the probabilities of winning the car under each strategy.
# Simulation of the Monty Hall problem# Setup the simulationrepetitions =10# Simulate the choices of the playerdf = pd.DataFrame({"first_choice": np.random.choice([1, 2, 3], size = repetitions), })df["opened_door"] = np.where(# If the player chose door 1 (the door with the car) (df["first_choice"] ==1),# , then the opened door is randomly chosen from doors 2 and 3 np.random.choice([2, 3], size = repetitions),# Otherwise np.where(# If the player chose door 2 (df["first_choice"] ==2),# , then the opened door is 3 (the host does not open the door with the car)3,# Otherwise the opened door is2 ))df["switch_choice"] = np.where(# If the opened door was 2 and the player chose 1 (df["opened_door"] ==2) & (df["first_choice"] ==1),# , then the switched choice is 33,# Otherwise np.where(# If the opened door was 3 and the player chose 1 (df["opened_door"] ==3) & (df["first_choice"] ==1),# , then the switched choice is 22,# Otherwise the switched choice is 11 ))df.head(n =20)
Under the first strategy of no switching, the probability of winning the car is 1/3 because the car is equally likely to be behind any of the three doors.
Under the second strategy of switching, the probability of winning the car is 2/3. To see this, consider the following:
The probability of the car being behind the door you chose is 1/3.
The probability of the car being behind one of the other two doors is 2/3.
The host will always open a door with a goat behind it. Therefore, the car is behind one of the other two doors, so the probability of winning the car by switching is 2/3.
Exercise 7.12 (The Executive’s Dilemma) Consider the following problem: A struggling company need to reduce its executive workforce from three executives to only one. The company has three executives: Alice, Bob, and Charlie. The company president decides to leave the decision of whom to keep to chance.
The president does not want to tell the executives who is remaining until the decision is announced. Alice insists, however, and asks the president to tell her at least the name of one person who is sacked. The president reminds Alice that he will reveal the name of the person remaining and also tells her that Bob is sacked.
Alice is happy because she thinks that the probability of her remaining given this information is 1/2 instead of 1/3 (without the information). Is Alice’s reasoning correct?
The probability of Alice remaining is 1/3 regardless of the information given by the president. The reason for this lies in the way the president chooses what to reveal. The president will always reveal the name of one person who is sacked, so the information given to Alice is not informative.
As the president chooses the person to remain at random, the probability of Alice remaining is 1/3.
Note that the events “Alice remains” and “Alice is sacked” are a partition (mutually exclusive and exhaustive) of the sample space.
It is crucial to understand that the mechanism by which the president reveals the name of one person who is sacked is important. If the question the president answered was “What will happen to Bob?” and the answer was “Bob is sacked” then Alice should calculate the conditional probability of her remaining given that Bob is sacked. In this case the probability would be 1/2.
P(\text{Alice remains} | \text{Bob is sacked}) = \frac{P(\text{Alice remains} \cap \text{Bob is sacked})}{P(\text{Bob is sacked})} = \frac{1/6}{1/3} = \frac{1}{2}