Probability is a fundamental mathematical paradigm for reasoning about the likelihood of events. It is used across a wide variety of fields for evaluating uncertainty, making decisions, and more. Probability provides an interface to the real world where we’re constantly faced with uncertainty. The aim of this article is to provide an introductory guide to probability and its notation, from complete beginners to experienced who need a review of the basics. This article intends to expand on the intuition of the basics and provide explicit examples where possible. Note: this article basic familiarity with set theory.
- Events are colored for emphasis
- Experiment: a process with a well-defined set of possible outcomes that can be infinitely repeated. An experiment is deterministic if there’s only one possible outcome, and random otherwise.
- Outcome: the result of an experiment
- Sample space: the set of all possible outcomes of an experiment, commonly denoted or , sometimes referred to as the universe
- Event: a subset of an experiment’s possible outcomes (i.e. the sample space). An event occurs if the experiment’s outcome is an element of the event. Events are the entities to which probabilities are assigned, as they include any possible combination of outcomes about which we might want to reason.
- Elementary event: an event containing a single outcome. Also known as an atomic event, sample point, or singleton. This is the formalized notion of a singular outcomes in terms of events.
- Complementary event: the complement of some event , denoted , is the event defined by the set of outcomes in the difference between the universe and event :
- Trials: sequence of experiments that are all independent and identical with the same sample space
Here is a simple example to provide context for the terminology introduced above:
Consider the process of rolling a single six-sided die. This process is a random experiment with a known set of possible outcomes (the six possible values that can end up on the top face of the die). This set of possible results is called the experiment’s sample space. If we let an integer represent the number of dots (pips) present on the top face of the die, then this experiment’s sample space can be written explicitly as . Perhaps we now want to reason about the probability of rolling an even value. This situation can be considered an event, call it , that represents the set of outcomes . Following the definition of events, we have , and is said to occur when one of the values , , or is rolled. The complementary event of is the event of rolling an odd number: .
How exactly do we define and interpret the probability of an event? This question is much more involved than it may first seem (see probability interpretations, probability measure). For now we’ll begin with standard axioms that describe expected qualities and behaviors about probability that reinforce our intuition.
The probability of an event is a real number between and :
where is a set of events. This is a basic statement about the numerical values we assign to probabilities; to indicate the event cannot occur, and to indicate certainty in the event’s occurrence.
The probability of at least one outcome occurring is . That is, the probability of the event that is the sample space occurring is :
Note: The sample space is an event just like any other collection of outcomes. Thus, the “sample space event” occurs when any outcome is observed, as the sample space includes every possible outcome.
If are mutually disjoint events, then
This just says that when we combine disjoint events (i.e. take their union), the probability of the newly combined event is the sum of the individual event’s probabilities. Here disjoint events refers to events that do not share any outcomes, and thus cannot occur jointly.
Consequences of the Axioms
The axioms of a system represent the fundamental building blocks from which every other theorem or statement can be derived. The following statements represent commonly used rules that can be obtained using only the three statements above. As with the axioms, these rules align well with our intuition.
where is a complement of event . This rule follows from the mutual exclusivity of events and (by the definition of complementary events) and as a result from axiom 3.
This statement makes intuitive sense: when one event contains only a subset of the outcomes of another event , then we expect that must be no more likely to occur than .
Probability of the union of events:
This says that the probability of the union of events and is the probability of plus the probability of , minus their joint probability of occurring. Why does this make sense, and how do we get here from just the axioms?
Firstly, we know that when events and are disjoint, following directly from axiom 3. But when and aren’t disjoint, they necessarily share some outcomes. That is, . Visually, overlapping events might look as follows: where the colored mass in this diagram represents the entirety of . Since the events share some outcomes, the sum will count the probability of overlapping outcomes twice, once for each of the events’ probabilities. This can be seen visually: Here we can see the dark green slice representing is counted twice in the final sum. In order to resolve this, we must subtract off one occurrence of the overlapping outcomes so that we only count them once in the final sum. Since the overlapping outcomes are given by the event , we are left with the final equation . The following diagram puts everything together: We’ve arrived at the justification for this equation mainly by intuition. If this approach doesn’t quite satisfy you, a slightly more formal proof can be seen here (by essentially writing out the visual steps above more explicitly). Note that the equation is often referred to as the sum rule, and can be generalized to any number of events as follows:
Conditional probabilities allow us to reason about the likelihood of events given some prior information (i.e. the prior occurrence of another event). It is often the case that this prior information will restrict the sample space or impact our beliefs in some way, leading to an updated probability for the event we care about. We often write conditional probabilities using the vertical bar such that stands for “the conditional probability of given ”. The mathematical definition looks as follows:
It’s worth expanding on the intuition behind this definition. In the numerator we see , which is the probability that both events and occur. In the denominator we have , which of course is just the probability of event occurring. By dividing these two values, we are effectively redefining the universe to the outcomes that make up event . That is, since we know has occurred, the possible outcomes it contains define a reduced sample space within which to consider event . We can interpret this sample space reduction visually as follows:
On the left, we see two overlapping events and defined in a standard sample space . Here the fraction of colored area to the area of the black outer rectangle represents the probability . The reduction of the sample space when thinking about can be seen by the transformation to the right diagram; knowing about ’s occurrence allows us to define it as the new sample space, seen by the darker black border. The colored area given by represents the remaining sliver of that’s in the reduced sample space, and the conditional probability is the fraction of colored area to area of the outer black border.
Conditional Probability Example
Here we extend the example above to better illustrate the idea of conditional probability and its formulation. Take the same instance as before, where we’ve rolled a die and wish to reason about the probability of having rolled an even value. We’ll call this event , representing the set of outcomes . Under the assumption of a fair die, it’s easy to see that the probability of is (all outcomes are equally likely, and our event represents three of the six possible outcomes). Now suppose that, before we’ve observed the outcome of the roll, we’ve been given the information that the value rolled is a prime number. That is, the event has occurred. What then can we say about the probability of the event ?
Here we’re asking about the conditional probability . Since event has occurred, its set of outcomes make up a new sample space within which to consider the event . That is, we no longer care about the original sample space now that we know has occurred, since it tells us the outcome must be one of the values , or . This gives us new insight into the likelihood of event , and with this context we can reevaluate what it means to roll an even value. The number is the only even value among the three possible prime values , and thus .
is a probability
An important observation is that conditional probabilities can be treated just like normal probabilities. The form satisfies all three axioms just like :
The probability of an event given is a real number between and
We can prove this by expanding the conditional probability
Where the left side of the inequality is simple (no probability can be less than zero). The right side follows from the fact that , and by monotonicity , implying can be at most one.
The probability of the sample space occurring is still one when conditioning on some event ; at least one outcome will still occur:
Proving this if fairly straightforward. Expanding the conditional probability, we have
following directly from the definition of the sample space, implying for any event .
For mutually exclusive events and some event being conditioned on,
I’ll leave out the proof for how this statement matches the third axiom, but it’s fairly straightforward to convince yourself why it might hold given some expansion of the conditional probability.
Why is this important, and what exactly are the implications of knowing this fact? –> Add some here …
Joint probability gives the likelihood of two (or more) events occurring. We can arrive at a definition by simply rearranging the conditional probability formula and solving for the intersection of two events: where here we’ve used the fact that appears in the definition of conditional probability for both and . !EXPAND THIS DERIVATION! This definition makes intuitive sense: when considering the likelihood of both events and occuring, we can consider having found out about the events occurring in an ordered fashion.
We might first hear about event ’s occurrence, in which case we’d just want to know unconditional on any other event. Upon finding out about event ’s occurrence, instead of finding , we must instead look for since we already know about ’s occurrence. This is roughly equivalent to interpreting the joint probability as “the probability of event times the probability of occurring given that has occurred”. Same goes for the other way around. An important thing to note here is that we’re not claiming that one event actually occurs before another; for all we know they occur at exactly the same time (e.g. picking a playing card that is both a red suit and a king). Instead, we’re reasoning about the joint probability of multiple events as if we found out about each event’s occurrence in some chronological order, in which case it makes more intuitive sense to think about conditioning events on each other. Note also that under this interpretation it shouldn’t matter which event we decide to reason about first, since our observation does not necessarily reflect the order in which the actual events occurred. This is indifference is reflected in the formula given above; the joint probability is the same, regardless of which event we use to condition. (NEEDS FIXING UP).
This can be generalized to an arbitrary number of events using the chain rule.
Multiplication Principle (Chain Rule)
This principle allows us to compute the probability of the intersection of an arbitrary number of events:
!PROVE WHY THIS IS HOLDS!
Joint Probability Example
Again extending the die example, we can further see how the joint probability might be applied. Here we’ll reuse the events introduced earlier: for rolling an even value, for rolling a prime value. Suppose now we want to reason about the probability of and occurring jointly. In this case we can just use the formula given above. Notice, however, that we need to decide on how we’re going to calculate the probabilities: we can use either form or . We’ll compute the joint using both forms for sake of completion and to show their equivalence in this example.
We’ll start with . is clear under the assumption of a fair die, we need only consider the fraction of outcomes each event includes out of the whole sample space ().
Two events and are said to be independent if
That is, the probability of both events occuring is exactly the product of their individual probabilities. Comparing this with the standard definition of the joint probability between two events, we see this implies that and . Intuitively, this just says that the likelihood of event ’s occurrence is not impacted by ’s occurrence, or that and are independent events. Independence is typically denoted $ A B $.
It’s important to note the difference between independent events and disjoint events (also called mutually exclusive events). If events and are disjoint, we have that . That is, when one event is given to have occurred, the conditional probability of the other event must be zero, since the events cannot occur together. This of course implies that . This differs from independent events, which are events whose outcome does not effect the other’s. That is, , since knowing that B has occurred tells us nothing about whether or not A has occurred. As a result, by the formula above.
The marginal probability is perhaps the most familiar type of probability; it’s just the probability of a single event unconditioned on any others. However, we often don’t know the probability a single event on its own, but instead have an array of joint or conditional probabilities related to the event. In this case, we can utilize some of the basic properties of probability to compute P(A) by marginalizing over some other event B. For example, we can write an event A in the following way
since and its complement are necessarily disjoint. This implies that the two events being unioned are disjoint, and thus we can write P(A) as
This P(A) is called a marginal probability, constructed from conditional probabilities involving another event B. This process is called marginalization, where we marginalize over events B and its complement to arrive at P(A) (sometimes also called the law of total probability). This can be generalized, when we have n mutually exclusive events Bᵢ for i=1,…,n whose union is the entire sample space, we can write A as
The intuition here is that the individual intersections give us little disjoint pieces of the event A, and the fact that all Bᵢ cover the entire sample space means we can cover all of A. Thus, from Axiom 3 we can simply sum up all of these probabilities to arrive at P(A). From the perspective of the conditional probabilities, each is “weighted” according the probability of the event on which it is conditioned.
Probability is an incredibly powerful tool for thinking about the world and evaluating uncertainty. This article barely scratches the surface of the field of probability theory and what can be done with it.
Thank you for reading this article. If you found this introductory guide helpful, please share! We hope the slightly more informal and visual take on the material proved useful in grasping the basic concepts. There are many great resources on probability theory that explore introductory material in a much more formal and complete manner. We encourage you to leave any feedback below, including typo fixes, incorrect notation, topic suggestions, etc.