7 Conditional Probability

Open in Google Colab:

Until now we have seen some basic properties of probability. In this part of the course we will learn how to account for partial information when calculating probabilities.

Suppose that you are playing a game of two dice and you want to bet that the first die will show a 6. A fried of yours rolls the dice and tells you that the sum of the two dice is 7. What is the probability that the first die shows a 6 given that the sum of the two dice is 7? Would this additional information help you when deciding whether to bet on the first die showing a 6?

This is an example of a conditional probability problem.

Definition 7.1 Let A and B be two events. The conditional probability of A given B is defined as

P(A|B) = \frac{P(A \cap B)}{P(B)}

We read P(A | B) as “the probability of A given B”.

Similar to the definition of a probability law that we have seen before, with equally likely outcomes in the sample space, we count the number of outcomes in the event A that are also in the event B and divide by the number of outcomes in B.

In the dice example we have 6^2=36 equally likely outcomes in the sample space.

\Omega = \left\{ \begin{align*} (1,1), (1,2), (1, 3), (1, 4), (1, 5), \mathbf{(1, 6)} \\ (2, 1), (2, 2), (2, 3), (2, 4), \mathbf{(2, 5)}, (2, 6) \\ (3, 1), (3, 2), (3, 3), \mathbf{(3, 4)}, (3, 5), (3, 6) \\ (4, 1), (4, 2), \mathbf{(4, 3)}, (4, 4), (4, 5), (4, 6) \\ (5, 1), \mathbf{(5, 2)}, (5, 3), (5, 4), (5, 5), (5, 6) \\ \mathbf{(6, 1)}, (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) \\ \end{align*} \right\}

The event B (sum equal to 7) consists of the following elements:

B = \left\{ \begin{align*} (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) \\ \end{align*} \right\}

The event A (first die shows a 6) consists of the following elements:

A = \left\{ \begin{align*} (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) \\ \end{align*} \right\}

Now you know that one of the outcomes in B has occurred (otherwise the sum will not be equal to 7). There are 6 outcomes in B and one of them is also in A, so the intersection of A and B is:

A \cap B = \left\{ \begin{align*} (6, 1) \\ \end{align*} \right\}

Therefore, the conditional probability of A given B is 1/6.

What about the unconditional probability of A?

P(A) = \frac{6}{36} = \frac{1}{6}

In this case the conditional probability of A given B is the same as the unconditional probability of A, meaning that knowing that the sum was 7 does not help you in deciding whether to bet on the first die showing a 6 (it does not change the odds). We say that the two events are indepedent (under P).

Consider, however, that the friend tells you that the sum of the two dice is 10. What is the probability that the first die shows a 6 given that the sum of the two dice is 10? Let’s denote this event as C.

In this case the event C consists of the following elements:

C = \left\{ \begin{align*} (4, 6), (5, 5), (6, 4) \\ \end{align*} \right\}

The elements in A are the same, but now we are looking for the elements which are both in A and C. There is only one element that is in both A and C:

A \cap C = \left\{ \begin{align*} (6, 4) \\ \end{align*} \right\}

So the conditional probability of A given C is 1/3. P(A|C) = \frac{\text{number of elements in } A \cap C}{\text{number of elements in } C} = \frac{1}{3}

So this time the knowledge that the sum is 10 is informative.

7.1 Independence of Two Events

Definition 7.2 (Independence of Two Events) Two events A and B are said to be independent if

P(A \cap B) = P(A) \cdot P(B)

The definition in Definition 7.2 is equivalent to the following, but has the advantage of being symmetric in A and B:

Exercise 7.1 (Independence and Conditional Probability) Show that the definition of independence in Definition 7.2 is equivalent to the following:

P(A|B) = P(A)
P(B|A) = P(B)

Exercise 7.2 (Implications of Independence) Show that if A and B are independent, then the following statements are true:

A and B^c are independent
A^c and B are independent
A^c and B^c are independent

Solution (click to expand)

From the definition of independence we have

P(A \cap B) = P(A) \cdot P(B)

A and B^c are independent

If A and B are independent, then it must be the case that

P(A \cap B^c) = P(A) \cdot P(B^c)

To show this, we can use the fact that B and B^c are disjoint and that B \cup B^c = \Omega

\begin{align*} P(A) = P(A \cap \Omega) &= P(A \cap (B \cup B^c)) \\ &= P((A \cap B) \cup (A \cap B^c)) \\ &= P(A \cap B) + P(A \cap B^c) \\ &= P(A) \cdot P(B) + P(A \cap B^c) \end{align*}

We can now solve for P(A \cap B^c)

P(A \cap B^c) = P(A) - P(A) \cdot P(B) = P(A) \cdot (1 - P(B)) = P(A) \cdot P(B^c)

So we showed the condition of independence for A and B^c.

P(A \cap B^c) = P(A) \cdot P(B^c)

Exercise 7.3 (Disjoint Events and Independence) Let A and B be two events such that P(A) > 0 and P(B) > 0. Show that if A and B are disjoint, then they are not independent.

7.2 Mutual Independence

The definition of independence can be extended to more than two events. However, an attempt to extend the definition by saying that A, B, and C are independent if P(A \cap B \cap C) = P(A) \cdot P(B) \cdot P(C) does not ensure that A, B, and C are pairwise independent in the sense of Definition 7.2. If the events are pairwise independent, there is no guarantee that the probability of their intersection is the product of their probabilities.

Definition 7.3 (Mutual Independence) Events A_1, A_2, \ldots, A_n are said to be mutually independent if for every subset I \subseteq \{1, 2, \ldots, n\} we have

P\left(\bigcap_{i \in I} A_i\right) = \prod_{i \in I} P(A_i)

For example for n=3 we have that A_1, A_2, and A_3 are mutually independent if all of the following conditions hold:

\begin{align*} P(A_1 \cap A_2) &= P(A_1) \cdot P(A_2) \\ P(A_1 \cap A_3) &= P(A_1) \cdot P(A_3) \\ P(A_2 \cap A_3) &= P(A_2) \cdot P(A_3) \\ P(A_1 \cap A_2 \cap A_3) &= P(A_1) \cdot P(A_2) \cdot P(A_3) \\ \end{align*}

Exercise 7.4 (Mutual Independence and Pairwise Independence) Consider a sample space with 9 equally likely outcomes, each consisting of a triple (i, j, k) where i, j, k \in \{1, 2, 3\}.

\Omega = \left\{ \begin{align*} (1, 1, 1), (2, 2, 2), (3, 3, 3) \\ (1, 2, 3), (1, 3, 2), (3, 2, 1) \\ (3, 1, 2), (2, 1, 3), (2, 3, 1) \\ \end{align*} \right\}

Consider the following events A_i = \{\text{i-th place of the outcome is 1}\}. Are the events A_1, A_2, and A_3 pairwise independent? Are they mutually independent?

7.3 Bayes’ Theorem

We have already seen that the conditional probability of A given B is defined as

P(A|B) = \frac{P(A \cap B)}{P(B)}

We can also write this as

P(A \cap B) = P(A|B) \cdot P(B)

and similarly

P(B \cap A) = P(B|A) \cdot P(A)

Because the first two terms are equal, we have

\begin{align*} P(A|B) \cdot P(B) = P(B|A) \cdot P(A) \\ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \end{align*}

This last equation is known as Bayes’ Theorem and it turns out that it holds in a more general way than here.

Theorem 7.1 (Bayes’ Theorem) Let A and B be two events such that P(A) > 0 and P(B) > 0. Then

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Furthermore, let A_1, A_2, \ldots, A_n be a partition of the sample space \Omega. A partition is a collection of disjoint events such that their union is the entire sample space:

A_i \cap A_j = \emptyset \quad \text{for all } i \neq j \quad \text{and} \quad \bigcup_{i=1}^n A_i = \Omega

For any event B such that P(B) > 0, we have

P(A_i|B) = \frac{P(B|A_i) \cdot P(A_i)}{\sum_{j=1}^n P(B|A_j) \cdot P(A_j)}

To understand the derivation of Bayes’ Theorem, see that it holds for a very simple partition. The collection of events A and A^c is a partition of the sample space. In this case, Bayes’ Theorem becomes

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B|A) \cdot P(A) + P(B|A^c) \cdot P(A^c)}

Exercise 7.5 (Total Probability and Complements) Let A and B be some events such that P(A) > 0 and P(B) > 0. Show that

P(B) = P(B|A) \cdot P(A) + P(B|A^c) \cdot P(A^c)

Note that the sets A and A^c form a partition of the sample space because A \cap A^c = \emptyset and A \cup A^c = \Omega.

Theorem 7.2 (Total Probability Theorem) Let A_1, A_2, \ldots, A_n be a partition of the sample space \Omega. For any event B such that P(B) > 0, we have

P(B) = \sum_{i=1}^n P(B|A_i) \cdot P(A_i)

Proof. The intersection of B with the sample space is B \cap \Omega = B as B is a subset of \Omega (as any event). Note that the sets B \cap A_i are disjoint because the A_i are disjoint. Therefore, we have

B = B \cap \Omega

Because the A_i are a partition of the sample space, their union is the sample space:

\Omega = \bigcup_{i=1}^n A_i

In this way we can represent B as the union of the intersections of B with the A_i:

B = B \cap \Omega = B \cap \left( \bigcup_{i=1}^n A_i \right) = \bigcup_{i=1}^n (B \cap A_i)

The last step follows from the distributive property of the intersection over the union. Now we can use the additivity of the probability measure to write

P(B) = P\left( \bigcup_{i=1}^n (B \cap A_i) \right) = \sum_{i=1}^n P(B \cap A_i) = \sum_{i=1}^n P(B|A_i) \cdot P(A_i)

Exercise 7.6 (Roadside Drug Tests) Imagine a city called Virtue City with 10,000 drivers in any day. Out of all drivers 100 of them use drugs while driving.

The police department introduces a new test for narcotics that has a 95 percent probability of correctly identifying a person who has used narcotics (true positive). There is a one percent probability of falsely identifying a person who has not used narcotics as having used them (false positive).

The police stop a driver, his test comes back positive, and the driver’s car is confiscated. The police claim that there is a 95 percent probability that the driver is a drug user. The driver objects and says that the probability of him being a drug user is much lower. Who is right?

Approach this problem in two ways:

Counting
- How many people will have a positive test result if all 100,000 citizens are tested?
- What proportion of the positive tests belong to actual drug users?
Apply Bayes’ Theorem

Exercise 7.7 (Spam Filtering) If you have ever used email, then you have probably encountered spam emails. Spam emails are unsolicited messages that are commonly sent in bulk to many recipients. Most email service providers try to protect their users from spam by filtering out these messages. One way to do this is to train an algorithm to recognize spam emails based on the content of the email.

Suppose that the filter receives an email containing the word “Inheritance”. Suppose that 5 percent of spam emails contain this word, while it is only present in 0.1 percent of non-spam emails. Suppose that 80 percent of the emails that the filter receives are spam.

What is the probability that an email containing the word “Inheritance” is spam?

Exercise 7.8 (Decomposition of the Total Probability) Prove the following statements:

P(B) = 1 \implies P(A|B) = P(A) for any event A.
A \subset B \implies P(B|A) = 1 and P(A|B) = P(A)/P(B)
If A \cap B = \emptyset

P(A | A \cup B) = \frac{P(A)}{P(A) + P(B)}

Theorem 7.3 (Chain Rule) Let A, B, and C be events such that P(A) > 0, P(B|A) > 0, and P(C|A \cap B) > 0. Then

P(A \cap B \cap C) = P(A|B\cap C) P(B|C) P(C)

Proof (click to expand)

The proof requires nothing more than the definition of conditional probability. Let D = B \cap C. Then we can write the intersection of the three events as

P(A \cap B \cap C) = P(A \cap D) = P(A|D) P(D) = P(A|B \cap C) P(B \cap C)

Apply the definition of conditional probability to the last term:

P(A|B \cap C) P(B \cap C) = P(A|B \cap C) P(B|C) P(C)

7.4 An Application of the Chain Rule

Recent advances in natural language processing have made possible applications such as chatbots that can interact with users in a natural way. Earlier probabilistic language models were based on the probability of a sentence.

Consider a simple sentence

“I am happy for you”.

One way to estimate the probability of this sentence is to count how many times it appears in a large corpus of text. However, this approach is very limited because long sequences of words are unlikely to appear in any corpus more than once and may not appear at all.

One way to work around this limitation is to express the probability of a sentence (joint occurrence of all the words in the sentence) as the product of the conditional probabilities of each word given the previous words using the chain rule.

\begin{align*} P(w_5, w_4, w_3, w_2, w_1) & = \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4, w_3, w_2, w_1) \\ & = P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3, w_2, w_1) \\ & = P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3 | w_2, w_1) \cdot P(w_2, w_1) \\ & = P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3 | w_2, w_1) \cdot P(w_2 | w_1) \cdot P(w_1) \end{align*}

We can now simplify this expression by assuming that the conditional probability of a word only depends on the word before it (Markov assumption).

P(w_5, w_4, w_3, w_2, w_1) = P(w_5 | w_4) \cdot P(w_4 | w_3) \cdot P(w_3 | w_2) \cdot P(w_2 | w_1) \cdot P(w_1)

It is now much simpler to estimate the conditional probabilities of each word given the previous word. We simply need to count how many times each word appears after the previous word in a large corpus of text.

%pip install nltk

import nltk
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.corpus import gutenberg
from nltk.probability import FreqDist
from nltk.util import ngrams

# Import Alice in Wonderland from the gutenber corpus
alice = gutenberg.raw('carroll-alice.txt')

Requirement already satisfied: nltk in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (3.9.1)
Requirement already satisfied: click in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (2024.9.11)
Requirement already satisfied: tqdm in /usr/share/miniconda/envs/ta2024/lib/python3.12/site-packages (from nltk) (4.66.5)
Note: you may need to restart the kernel to use updated packages.

[nltk_data] Downloading package gutenberg to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /home/runner/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /home/runner/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.

# Use FreqDist to find the frequency of the specific sequence
sequence = "Alice said hello"

# Create n-grams from the tokens
sequence_tokens = nltk.word_tokenize(sequence)
tokens = nltk.word_tokenize(alice)
ngrams_list = list(ngrams(tokens, len(sequence_tokens)))
fdist = FreqDist(ngrams_list)

sequence_frequency = fdist[tuple(sequence_tokens)]
print(f"The sequence '{sequence}' appears {sequence_frequency} times in the text.")

The sequence 'Alice said hello' appears 0 times in the text.

fdist.most_common(10)

[((',', "'", 'said'), 215),
 (("'", 'said', 'the'), 203),
 (("'", 'said', 'Alice'), 115),
 (('.', "'", 'I'), 69),
 (('!', "'", 'said'), 65),
 ((',', "'", 'the'), 59),
 (('*', '*', '*'), 54),
 (('the', 'Mock', 'Turtle'), 49),
 (("'", "'", 'I'), 47),
 ((',', 'and', 'the'), 42)]

Exercise 7.9 (Quality Control) A factory produces light bulbs. To ensure the quality of the bulbs shipped to customers, the factory has a quality control system that checks a sample of the bulbs before they are shipped. Consider a shipment of 100 bulbs. You take a sample (without replacement) of 5 bulbs and stop the shipment if you find that at least one of the sampled bulbs are defective. Suppose that a shipment contains 20 defective bulbs. What is the probability that the shipment will be stopped?

# Simulate a sample from a shipment of 100 light bulbs
import numpy as np

sample_size = 5
repetitions = 10

# 0 represents a good light bulb, 1 represents a defective light bulb
shipment = np.array([0] * 80 + [1] * 20)

print("The shipment of light bulbs")
print(shipment)
print("\n")

rejections = 0

print("Some of the selected samples")

for i in range(repetitions):
    sample = np.random.choice(shipment, size = sample_size, replace = False)
    
    # Print the sample once every 1000 repetitions
    if i % 1000 == 0:
        print(sample)
        
    if np.sum(sample) > 0:
        rejections += 1

print("\n")
print(f"The proportion of shipments rejected is {rejections / repetitions}")

The shipment of light bulbs
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


Some of the selected samples
[0 1 1 0 0]


The proportion of shipments rejected is 0.6

Solution (click to expand)

We need to compute the probability that at least one of the sampled bulbs is defective, given that there are 20 defective bulbs in the shipment of 100 bulbs. We can define the event A as the event that at least one of the sampled bulbs is defective.

A = \text{at least one defective bulb in the sample}

Let A_1, A_2, \ldots, A_{5} be the events that the i-th bulb in the sample is defective. Then we have

A = A_1 \cup A_2 \cup \ldots \cup A_{5}

It is easy to compute the probability of the complement of A using de Morgan’s law

P(A) = 1 - P(A^c) = 1 - P(A_1^c \cap A_2^c \cap \ldots \cap A_{5}^c)

The chain rule gives us a way to decompose the probability of the intersections of the complements

P(A_1^c \cap A_2^c \cap \ldots \cap A_{5}^c) = P(A_1^c) \cdot P(A_2^c | A_1^c) \cdot P(A_3^c | A_1^c \cap A_2^c) \cdot \ldots \cdot P(A_{5}^c | A_1^c \cap A_2^c \cap \ldots \cap A_{4}^c)

The probability of the first bulb not being defective is

P(A_1^c) = \frac{80}{100}

The probability of the second bulb not being defective given that the first bulb is not defective is

P(A_2^c | A_1^c) = \frac{79}{99}

Continuing in this way, we find that

P(A) = 1 - \frac{80}{100} \cdot \frac{79}{99} \cdot \frac{78}{98} \cdot \frac{77}{97} \cdot \frac{76}{96}

# Compute the probability of a shipment being rejected

Exercise 7.10 (A Bifurcation in the Road) A tourists stands at a bifurcation in the road and does not know which is the correct way to go. He asks two passersby. Each passerby points him in the correct direction with probability 0.7 independently of each other. We want to propose a strategy for the tourist to decide which way to go.

Strategy 1: If the two passersby point in the same direction, the tourist follows that direction. If they point in different directions, the tourist chooses randomly.
Strategy 2: The tourist decides to follow the direction of the first passerby only.
Strategy 3: The tourist decides at random.
Which strategy would you recommend? Decide this by calculating the probability of the tourist going in the correct direction for each strategy.
Simulate the three strategies and estimate the probability of the tourist going in the correct direction. For the sake of the simulation, assume that the correct direction is the first one.

import pandas as pd
import numpy as np

# Setup the simulation
repetitions = 10

# Simulate the third strategy

df = pd.DataFrame({
    "passerby1": np.random.choice([1, 2], size = repetitions, p = [0.7, 0.3]),
    "passerby2": np.random.choice([1, 2], size = repetitions, p = [0.7, 0.3]),
    "random_choice": np.random.choice([1, 2], size = repetitions, p = [0.5, 0.5])
})

# Checks if the two passerby agree on the direction. If yes, then take the direction of the first passerby 
# (both are equal). If not, then take the random direction

df["strategy1"] = np.where(
    df["passerby1"] == df["passerby2"],
    df["passerby1"],
    df["random_choice"]
)

df.head()

	passerby1	passerby2	random_choice	strategy1
0	1	2	1	1
1	1	1	1	1
2	2	1	1	1
3	1	1	2	1
4	1	2	1	1

print("Strategy 1 success rate: ", (df["strategy1"] == 1).sum() / repetitions)
print("Strategy 2 success rate: ", (df["passerby1"] == 1).sum() / repetitions)
print("Strategy 3 success rate: ", (df["random_choice"] == 1).sum() / repetitions)

Strategy 1 success rate:  0.9
Strategy 2 success rate:  0.7
Strategy 3 success rate:  0.6

Solution (click to expand)

Let us consider the scenarios leading to the correct choice, assuming the passersby are independent and point in the right direction with probability p and in the wrong direction with probability 1-p.

The passersby agree and are correct with probability p^2.
The first passerby is right and the the tourist follows his or her advice p(1 - p)/2.
The second passerby is right the tourist follows his or her advice: (1 - p)p /2.

These events are mutually exclusive and exhaustive, so the probability of the tourist going in the correct direction under Strategy 1 is

p^2 + \frac{p(1 - p)}{2} + \frac{(1 - p)p}{2} = p^2 + p(1 - p) = p

Exercise 7.11 (The Monty Hall Game)

Imagine a game with three doors. Behind one of the doors is a car and behind the other two are goats. The game is played in two stages.

First you choose a door.
The host, who knows what is behind each door, opens one of the other two doors to reveal a goat.
You are then given the opportunity to switch to the other unopened door or stay with your original choice.

Which strategy has a higher probability of winning the car? Should you switch or should you stay with your original choice? Compute the probabilities of winning the car under each strategy.

# Simulation of the Monty Hall problem

# Setup the simulation
repetitions = 10

# Simulate the choices of the player

df = pd.DataFrame({
    "first_choice": np.random.choice([1, 2, 3], size = repetitions),    
})

df["opened_door"] = np.where(
    # If the player chose door 1 (the door with the car)
    (df["first_choice"] == 1),
    # , then the opened door is randomly chosen from doors 2 and 3
    np.random.choice([2, 3], size = repetitions),
    # Otherwise
    np.where(
        # If the player chose door 2
        (df["first_choice"] == 2),
        # , then the opened door is 3 (the host does not open the door with the car)
        3,
        # Otherwise the opened door is
        2
    )
)

df["switch_choice"] = np.where(
    # If the opened door was 2 and the player chose 1
    (df["opened_door"] == 2) & (df["first_choice"] == 1),
    # , then the switched choice is 3
    3,
    # Otherwise
    np.where(
        # If the opened door was 3 and the player chose 1
        (df["opened_door"] == 3) & (df["first_choice"] == 1),
        # , then the switched choice is 2
        2,
        # Otherwise the switched choice is 1
        1
    )
)

df.head(n = 20)

	first_choice	opened_door	switch_choice
0	1	3	2
1	2	3	1
2	1	2	3
3	1	2	3
4	1	3	2
5	2	3	1
6	3	2	1
7	2	3	1
8	2	3	1
9	1	3	2

print("No switching wins", (df["first_choice"] == 1).mean())
print("Switching wins", (df["switch_choice"] == 1).mean())

No switching wins 0.5
Switching wins 0.5

Solution (click to expand)

Under the first strategy of no switching, the probability of winning the car is 1/3 because the car is equally likely to be behind any of the three doors.

Under the second strategy of switching, the probability of winning the car is 2/3. To see this, consider the following:

The probability of the car being behind the door you chose is 1/3.
The probability of the car being behind one of the other two doors is 2/3.
The host will always open a door with a goat behind it. Therefore, the car is behind one of the other two doors, so the probability of winning the car by switching is 2/3.

Exercise 7.12 (The Executive’s Dilemma) Consider the following problem: A struggling company need to reduce its executive workforce from three executives to only one. The company has three executives: Alice, Bob, and Charlie. The company president decides to leave the decision of whom to keep to chance.

The president does not want to tell the executives who is remaining until the decision is announced. Alice insists, however, and asks the president to tell her at least the name of one person who is sacked. The president reminds Alice that he will reveal the name of the person remaining and also tells her that Bob is sacked.

Alice is happy because she thinks that the probability of her remaining given this information is 1/2 instead of 1/3 (without the information). Is Alice’s reasoning correct?

The example here is a reformulation of the famous Prisoner’s Dilemma.

Solution (click to expand)

The probability of Alice remaining is 1/3 regardless of the information given by the president. The reason for this lies in the way the president chooses what to reveal. The president will always reveal the name of one person who is sacked, so the information given to Alice is not informative.

As the president chooses the person to remain at random, the probability of Alice remaining is 1/3.

\begin{align*} & P(\text{Alice remains}) = \frac{1}{3} \\ & P(\text{Bob remains}) = \frac{1}{3} \\ & P(\text{Charlie remains}) = \frac{1}{3} \end{align*}

Consider all (mutually exclusive) scenarios:

Alice remains and the president reveals Bob’s name (probability 1/3 \cdot 1/2 = 1/6)
Alice remains and the president reveals Charlie’s name (probability 1/3 \cdot 1/2 = 1/6)
Alice is sacked (Bob remains) and the president reveals Charlie’s name (probability 1/3 \cdot 1 = 1/3)
Alice is sacked (Charlie remains) and the president reveals Bob’s name (probability 1/3 \cdot 1 = 1/3)

We are looking for the conditional probability of Alice remaining given that the president reveals Bob’s name.

P(\text{Alice remains} | \text{Bob revealed}) = \frac{P(\text{Alice remains} \cap \text{Bob revealed})}{P(\text{Bob revealed})} = \frac{1/6}{1/2} = \frac{1}{3}

P(\text{Bob revealed}) = P(\text{Alice remains} \cap \text{Bob revealed}) + P(\text{Alice is sacked} \cap \text{Bob revealed}) = 1/6 + 1/3 = 1/2

Note that the events “Alice remains” and “Alice is sacked” are a partition (mutually exclusive and exhaustive) of the sample space.

It is crucial to understand that the mechanism by which the president reveals the name of one person who is sacked is important. If the question the president answered was “What will happen to Bob?” and the answer was “Bob is sacked” then Alice should calculate the conditional probability of her remaining given that Bob is sacked. In this case the probability would be 1/2.

P(\text{Alice remains} | \text{Bob is sacked}) = \frac{P(\text{Alice remains} \cap \text{Bob is sacked})}{P(\text{Bob is sacked})} = \frac{1/6}{1/3} = \frac{1}{2}