# Introduction to basic probability

An overview of fundamental probability concepts

• Events are colored $\textcolor{green}{this}$ for emphasis

# Terminology

• Experiment: a process with a well-defined set of possible outcomes that can be infinitely repeated. An experiment is deterministic if there’s only one possible outcome, and random otherwise.
• Outcome: the result of an experiment
• Sample space: the set of all possible outcomes of an experiment, commonly denoted $S$ or $\Omega$, sometimes referred to as the universe
• Event: a subset of an experiment’s possible outcomes (i.e. the sample space). An event occurs if the experiment’s outcome is an element of the event. Events are the entities to which probabilities are assigned, as they include any possible combination of outcomes about which we might want to reason.
• Elementary event: an event containing a single outcome. Also known as an atomic event, sample point, or singleton. This is the formalized notion of a singular outcomes in terms of events.
• Complementary event: the complement of some event $E$, denoted $E^C$, is the event defined by the set of outcomes in the difference between the universe and event $E$: $E^C = \Omega \backslash E$
• Trials: sequence of experiments that are all independent and identical with the same sample space

## Example

Here is a simple example to provide context for the terminology introduced above:

Consider the process of rolling a single six-sided die. This process is a random experiment with a known set of possible outcomes (the six possible values that can end up on the top face of the die). This set of possible results is called the experiment’s sample space. If we let an integer represent the number of dots (pips) present on the top face of the die, then this experiment’s sample space can be written explicitly as $\Omega = \{1,2,3,4,5,6\}$. Perhaps we now want to reason about the probability of rolling an even value. This situation can be considered an event, call it $E$, that represents the set of outcomes $\{2,4,6\}$. Following the definition of events, we have $E \subseteq \Omega$, and $E$ is said to occur when one of the values $2$, $4$, or $6$ is rolled. The complementary event of $E$ is the event of rolling an odd number: $E^C = \Omega \backslash E = \{1,2,3,4,5,6\} \backslash \{2,4,6\} = \{1,3,5\}$.

# Axioms

How exactly do we define and interpret the probability of an event? This question is much more involved than it may first seem (see probability interpretations, probability measure). For now we’ll begin with standard axioms that describe expected qualities and behaviors about probability that reinforce our intuition.

1. The probability of an event is a real number between $0$ and $1$:

$0 \le P(\textcolor{green}{A}) \le 1 , \forall \textcolor{green}{A} \in F$

where $F$ is a set of events. This is a basic statement about the numerical values we assign to probabilities; $0$ to indicate the event cannot occur, and $1$ to indicate certainty in the event’s occurrence.

2. The probability of at least one outcome occurring is $1$. That is, the probability of the event that is the sample space occurring is $1$:

$P(\Omega) = 1$

Note: The sample space is an event just like any other collection of outcomes. Thus, the “sample space event” occurs when any outcome is observed, as the sample space includes every possible outcome.

3. If $A_1, A_2, \dots$ are mutually disjoint events, then

$P\Big(\bigcup_{i=1}^{\infty}{\textcolor{green}{A}_i}\Big) = \sum_{i=1}^{\infty}{P(\textcolor{green}{A}_i)}$

This just says that when we combine disjoint events (i.e. take their union), the probability of the newly combined event is the sum of the individual event’s probabilities. Here disjoint events refers to events that do not share any outcomes, and thus cannot occur jointly.

# Consequences of the Axioms

The axioms of a system represent the fundamental building blocks from which every other theorem or statement can be derived. The following statements represent commonly used rules that can be obtained using only the three statements above. As with the axioms, these rules align well with our intuition.

1. Complementary events:

$P(A^C) = 1-P(A)$

where $\textcolor{green}{A}^C$ is a complement of event $\textcolor{green}{A}$. This rule follows from the mutual exclusivity of events $A$ and $A^C$ (by the definition of complementary events) and as a result $P(A) + P(A^C) = P(A \cup A^C) = P(\Omega) = 1$ from axiom 3.

2. Monotonicity:

$\text{If } A \subseteq B \text{, then } P(A) \le P(B)$

This statement makes intuitive sense: when one event $A$ contains only a subset of the outcomes of another event $B$, then we expect that $A$ must be no more likely to occur than $B$.

3. Probability of the union of events:

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$

This says that the probability of the union of events $A$ and $B$ is the probability of $A$ plus the probability of $B$, minus their joint probability of occurring. Why does this make sense, and how do we get here from just the axioms?

Firstly, we know that $P(A \cup B) = P(A) + P(B)$ when events $A$ and $B$ are disjoint, following directly from axiom 3. But when $A$ and $B$ aren’t disjoint, they necessarily share some outcomes. That is, $A \cap B \not= \empty$. Visually, overlapping events might look as follows: where the colored mass in this diagram represents the entirety of $A \cup B$. Since the events share some outcomes, the sum $P(A) + P(B)$ will count the probability of overlapping outcomes twice, once for each of the events’ probabilities. This can be seen visually: Here we can see the dark green slice representing $A \cap B$ is counted twice in the final sum. In order to resolve this, we must subtract off one occurrence of the overlapping outcomes so that we only count them once in the final sum. Since the overlapping outcomes are given by the event $A \cap B$, we are left with the final equation $P(A \cup B) = P(A) + P(B) - P(A \cap B)$. The following diagram puts everything together: We’ve arrived at the justification for this equation mainly by intuition. If this approach doesn’t quite satisfy you, a slightly more formal proof can be seen here (by essentially writing out the visual steps above more explicitly). Note that the equation $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ is often referred to as the sum rule, and can be generalized to any number of events $A_1, \dots, A_n$ as follows:

$P\Big(\bigcup^n_{i=1}{A_i}\Big) = \sum_{r=1}^n{(-1)^{r+1}\sum_{i_1<\cdots< i_r}{P(A_{i_1} \cap \cdots \cap A_{i_r})}}$

# Conditional Probability

Conditional probabilities allow us to reason about the likelihood of events given some prior information (i.e. the prior occurrence of another event). It is often the case that this prior information will restrict the sample space or impact our beliefs in some way, leading to an updated probability for the event we care about. We often write conditional probabilities using the vertical bar $|"$ such that $P(A|B)$ stands for “the conditional probability of $A$ given $B$”. The mathematical definition looks as follows:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

It’s worth expanding on the intuition behind this definition. In the numerator we see $P(A \cap B)$, which is the probability that both events $A$ and $B$ occur. In the denominator we have $P(B)$, which of course is just the probability of event $B$ occurring. By dividing these two values, we are effectively redefining the universe to the outcomes that make up event $B$. That is, since we know $B$ has occurred, the possible outcomes it contains define a reduced sample space within which to consider event $A$. We can interpret this sample space reduction visually as follows:

On the left, we see two overlapping events $A$ and $B$ defined in a standard sample space $\Omega$. Here the fraction of colored area to the area of the black outer rectangle represents the probability $P(A \cup B)$. The reduction of the sample space when thinking about $P(A|B)$ can be seen by the transformation to the right diagram; knowing about $B$’s occurrence allows us to define it as the new sample space, seen by the darker black border. The colored area given by $A \cap B$ represents the remaining sliver of $A$ that’s in the reduced sample space, and the conditional probability $P(A|B)$ is the fraction of colored area to area of the outer black border.

## Conditional Probability Example

Here we extend the example above to better illustrate the idea of conditional probability and its formulation. Take the same instance as before, where we’ve rolled a die and wish to reason about the probability of having rolled an even value. We’ll call this event $E$, representing the set of outcomes $\{2,4,6\}$. Under the assumption of a fair die, it’s easy to see that the probability of $E$ is $\frac{1}{2}$ (all outcomes are equally likely, and our event represents three of the six possible outcomes). Now suppose that, before we’ve observed the outcome of the roll, we’ve been given the information that the value rolled is a prime number. That is, the event $C = \{2,3,5\}$ has occurred. What then can we say about the probability of the event $E$?

Here we’re asking about the conditional probability $P(E|C)$. Since event $C$ has occurred, its set of outcomes $\{2,3,5\}$ make up a new sample space within which to consider the event $E$. That is, we no longer care about the original sample space $\Omega = \{1,2,3,4,5,6\}$ now that we know $C$ has occurred, since it tells us the outcome must be one of the values $2, 3$, or $5$. This gives us new insight into the likelihood of event $E$, and with this context we can reevaluate what it means to roll an even value. The number $2$ is the only even value among the three possible prime values $\{2,3,5\}$, and thus $P(E|C) = \frac{1}{3}$.

## $P(\cdot \vert F)$ is a probability

An important observation is that conditional probabilities can be treated just like normal probabilities. The form $P(\cdot|F)$ satisfies all three axioms just like $P(\cdot)$:

1. The probability of an event $E$ given $F$ is a real number between $0$ and $1$ $0 \le P(E | F) \le 1$

We can prove this by expanding the conditional probability

$0 \le \frac{P(E \cap F)}{P(F)} \le 1$

Where the left side of the inequality is simple (no probability can be less than zero). The right side follows from the fact that $E \cap F \subseteq F$, and by monotonicity $P(E \cap F) \le P(F)$, implying $P(E \cap F) / P(F)$ can be at most one.

2. The probability of the sample space occurring is still one when conditioning on some event $F$; at least one outcome will still occur:

$P(\Omega|F) = 1$

Proving this if fairly straightforward. Expanding the conditional probability, we have

$P(\Omega|F) = \frac{P(\Omega \cap F)}{P(F)} = \frac{P(F)}{P(F)} = 1$

following directly from the definition of the sample space, implying $F \subseteq \Omega$ for any event $F$.

3. For mutually exclusive events $E_1, \dots, E_n$ and some event $F$ being conditioned on,

$P \Big(\bigcup_{i=1}^{\infty}{E_i|F} \Big) = \sum_{i=1}^{\infty}{P(E_i|F)}$

I’ll leave out the proof for how this statement matches the third axiom, but it’s fairly straightforward to convince yourself why it might hold given some expansion of the conditional probability.

Why is this important, and what exactly are the implications of knowing this fact? –> Add some here …

# Joint Probability

Joint probability gives the likelihood of two (or more) events occurring. We can arrive at a definition by simply rearranging the conditional probability formula and solving for the intersection of two events: $P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)$ where here we’ve used the fact that $P(A \cap B)$ appears in the definition of conditional probability for both $P(A | B)$ and $P(B | A)$. !EXPAND THIS DERIVATION! This definition makes intuitive sense: when considering the likelihood of both events $A$ and $B$ occuring, we can consider having found out about the events occurring in an ordered fashion.

We might first hear about event $A$’s occurrence, in which case we’d just want to know $P(A)$ unconditional on any other event. Upon finding out about event $B$’s occurrence, instead of finding $P(B)$, we must instead look for $P(B|A)$ since we already know about $A$’s occurrence. This is roughly equivalent to interpreting the joint probability as “the probability of event $A$ times the probability of $B$ occurring given that $A$ has occurred”. Same goes for the other way around. An important thing to note here is that we’re not claiming that one event actually occurs before another; for all we know they occur at exactly the same time (e.g. picking a playing card that is both a red suit and a king). Instead, we’re reasoning about the joint probability of multiple events as if we found out about each event’s occurrence in some chronological order, in which case it makes more intuitive sense to think about conditioning events on each other. Note also that under this interpretation it shouldn’t matter which event we decide to reason about first, since our observation does not necessarily reflect the order in which the actual events occurred. This is indifference is reflected in the formula given above; the joint probability is the same, regardless of which event we use to condition. (NEEDS FIXING UP).

This can be generalized to an arbitrary number of events using the chain rule.

## Multiplication Principle (Chain Rule)

This principle allows us to compute the probability of the intersection of an arbitrary number of events:

$P(E_1 \cap \cdots \cap E_n) = P(E_1)P(E_2|E_1)\cdots P(E_n|E_1,\dots,E_{n-1})$

!PROVE WHY THIS IS HOLDS!

## Joint Probability Example

Again extending the die example, we can further see how the joint probability might be applied. Here we’ll reuse the events introduced earlier: $E = \\{2,4,6\\}$ for rolling an even value, $C = \\{2,3,5\\}$ for rolling a prime value. Suppose now we want to reason about the probability of $E$ and $C$ occurring jointly. In this case we can just use the formula given above. Notice, however, that we need to decide on how we’re going to calculate the probabilities: we can use either form $P(E|C)P(C)$ or $P(C|E)P(E)$. We’ll compute the joint using both forms for sake of completion and to show their equivalence in this example.

We’ll start with $P(E|C)P(C)$. $P(E) = P(C) = \frac{1}{3}$ is clear under the assumption of a fair die, we need only consider the fraction of outcomes each event includes out of the whole sample space ($\frac{3}{6}$). $P(E|C)$

# Independence

Two events $A$ and $B$ are said to be independent if

$P(A \cap B) = P(A)P(B)$

That is, the probability of both events occuring is exactly the product of their individual probabilities. Comparing this with the standard definition of the joint probability between two events, we see this implies that $P(A|B) = P(A)$ and $P(B|A) = P(B)$. Intuitively, this just says that the likelihood of event $A$’s occurrence is not impacted by $B$’s occurrence, or that $A$ and $B$ are independent events. Independence is typically denoted $A B$.

It’s important to note the difference between independent events and disjoint events (also called mutually exclusive events). If events $A$ and $B$ are disjoint, we have that $P(A|B) = P(B|A) = 0$. That is, when one event is given to have occurred, the conditional probability of the other event must be zero, since the events cannot occur together. This of course implies that $P(A \text{ and } B) = 0$. This differs from independent events, which are events whose outcome does not effect the other’s. That is, $P(A|B) = P(A)$, since knowing that B has occurred tells us nothing about whether or not A has occurred. As a result, $P(A and B) = P(A)P(B)$ by the formula above.

# Marginal Probability

The marginal probability is perhaps the most familiar type of probability; it’s just the probability of a single event unconditioned on any others. However, we often don’t know the probability a single event on its own, but instead have an array of joint or conditional probabilities related to the event. In this case, we can utilize some of the basic properties of probability to compute P(A) by marginalizing over some other event B. For example, we can write an event A in the following way

$A = (A \cap B) \cup (A \cap B^C)$

since $B$ and its complement $B^C$ are necessarily disjoint. This implies that the two events being unioned are disjoint, and thus we can write P(A) as

$P(A) = P(A \cap B) + P(A \cap B^C)$

$P(A) = P(A|B)P(B) + P(A|B^C)P(B^C)$

This P(A) is called a marginal probability, constructed from conditional probabilities involving another event B. This process is called marginalization, where we marginalize over events B and its complement to arrive at P(A) (sometimes also called the law of total probability). This can be generalized, when we have n mutually exclusive events Bᵢ for i=1,…,n whose union is the entire sample space, we can write A as

$A = (A \cap B_1) \cup \cdots \cup (A \cap B_n)$

$P(A) = \sum_{i=1}^n{P(A \cap B_i)} = \sum_{i=1}^n{P(A|B_i)P(B_i)}$

The intuition here is that the individual intersections give us little disjoint pieces of the event A, and the fact that all Bᵢ cover the entire sample space means we can cover all of A. Thus, from Axiom 3 we can simply sum up all of these probabilities to arrive at P(A). From the perspective of the conditional probabilities, each is “weighted” according the probability of the event on which it is conditioned.

# Conclusion

Probability is an incredibly powerful tool for thinking about the world and evaluating uncertainty. This article barely scratches the surface of the field of probability theory and what can be done with it.

Thank you for reading this article. If you found this introductory guide helpful, please share! We hope the slightly more informal and visual take on the material proved useful in grasping the basic concepts. There are many great resources on probability theory that explore introductory material in a much more formal and complete manner. We encourage you to leave any feedback below, including typo fixes, incorrect notation, topic suggestions, etc.