The Elementary Theory of Probability

Elementary Set Theory

We present some elementary set theory to facilitate our discussion of probability theory and, ultimately, Markov chains.

Informally, a set is a collection of objects, for example, the set of numbers between $1$ and $6$, written $\{1,2,3,4,5,6\}$. We can define the following operations on sets:

  1. The union of two sets $A$ and $B$, written $A \cup B$, is the set of elements in either set.
  2. The intersection of two sets $A$ and $B$, written $A \cap B$, is the set of elements common to both sets. Note that union and intersection distribute over each other: \begin{eqnarray*} \left(A \cup B\right) \cap C = \left(A \cap C\right) \cup \left(B \cap C\right)\\ \left(A \cap B\right) \cup C = \left(A \cup C\right) \cap \left(B \cup C\right) \end{eqnarray*} Note that this is different from the relationship between addition and multiplication, multiplication distributes over addition, that is, $\left(a + b\right)*c = a*c + b*c$ but addition doesn't distribution over multiplication, that is, $a*b + c \neq (a+c)*(b+c)$. There is symmetry in the operations of union and intersection that doesn't exist for addition and multiplication.
  3. The difference of two sets $A$ and $B$, written $A - B$ is the set of elements in $A$ which are not in $B$.
  4. The Cartesian product of two sets $S$ and $S'$, written $S \times S'$ is the set of pairs consisting of one element from $S$ and one element from $S'$. The pair consisting of $s\in S$ and $s'\in S'$ is written $(s,s')$ so that the Cartesian product of $S$ and $S'$ can be written as: \[ S \times S' = \left\{\left(s,s'\right): s \in S, s' \in S'\right\} \]
  5. For a (possibly infinite) sequence of sets $A_1, A_2, \ldots$, we can take the union, intersection or Cartesian product which we write as $\cup_i A_i$, $\cap_i A_i$ and $\prod_i A_i$ respectively.
  6. The number of elements in a set $S$ is written $\left|S\right|$.

If $S$ and $S'$ are finite, the number of elements of $S \times S'$ is the product of the number of elements in $S$ and the number of elements in $S'$: \[ \left| S \times S' \right| = \left| S \right| \left| S' \right| \] If $S_1 = S_2 = S$ then we sometimes write $S^2 = S \times S$ and we have $\left|S \times S\right| = \left|S^2\right| = \left|S\right|^2$. The size of the the number of pairs in a pairwise Cartesian product can be extended to arbitrary Cartesian products: \[ \left| \prod_{i=1}^n S_i\right| = \prod_{i=1}^n \left|S_i\right| \] If $S_1 = S_2 = \ldots = S_n = S$ then we similarly sometimes write $S^n = \prod_{i=1}^n S$ and we have $\left|\prod_{i=1}^n S\right| = \left|S^n\right| = \left|S\right|^n$.

Example: Pairs of Die Rolls

Consider the set of possible labels that come up on a roll of a standard $6$-sided die, $S = \left\{1,2,3,4,5,6\right\}$. The set of pairs of labels from $2$ rolls is given by: \[ S\times S = \left\{(1,1), (1,2), \ldots, (1,6), (2,1), (2,2), \ldots, (2,6), \ldots, (6,1), (6,2), \ldots, (6,6)\right\} \] The number of such pairs is $\left|S\times S\right| = \left|S\right|^2 = 6^2 = 36$.

Example: Many Die Rolls

Now consider $n$ rolls of a standard 6-sided die. We have $\left| \prod_{i=1}^n S \right| = \left| S \right|^n = 6^n$.

Elementary Probability Theory

Many of the intuitive elements of probability theory date back to antiquity among various cultures in analyzing games of chance as well as cryptography. A good place to mark the start of the mathematical theory of probability is with the correspondence between the mathematicians Pierre de Fermat and Blaise Pascal in 1654 which was codified in a book by Cardano in 1657. This correspondence began by analyzing whether one should bet even money on whether double 6's come up in 24 rolls of a pair of dice. The intuitive calculation can be performed as follows. There is a $1$ in $6$ chance that a $6$ occurs on a single roll of a single die. Therefore, there is a $1$ in $36$ chance that a roll of a pair of dice yields double 6's. The probability that double 6's are not rolled on a pair of dice is therefore $1 - 1/36 = 35/36$. Hence, the probability that double 6's are not rolled in 24 rolls of a pair of dice is $\left(\frac{35}{36}\right)^{24} \approx 50.9\%$. The mathematical theory of probability will allow us to go through the details of why this calculation is done this way.

A point mass function on a set $S = \{s_1,s_2,\ldots,s_n\}$ is a function $p:S\rightarrow [0,1]$, where we write $p_{s_i}$ for $p\left(s_i\right)$, that sums to $1$, that is, $\sum_i p_{s_i} = 1$. Given a point mass function, the probability distribution, $\mathbb{P}$ is the function on subsets of $S$ such that, for any subset $S' \subseteq S$: \[ \mathbb{P}\left(S'\right) = \sum_{s \in S'} p_s \] A probability distribution, $\mathbb{P}$, has the following properties:

  1. Positivity: For any $S'\subseteq S$, we have that $\mathbb{P}\left(S'\right)\geq 0$
  2. Additivity: If $S', S'' \subseteq S$ and $S' \cap S'' = \emptyset$, that is, $S'$ and $S''$ are disjoint sets, then $\mathbb{P}\left( S' \cup S''\right) = \mathbb{P}\left(S'\right) + \mathbb{P}\left(S''\right)$
  3. Total probability: $\mathbb{P}(S) = 1$

The additivity property above can be extended to any finite number of disjoint sets, that is, assume that $S_1, S_2, \ldots, S_n \subseteq S$ are disjoint, that is, $S_i \cap S_j = \emptyset$ for $i\neq j$. First, note that $\cup_{i=1}^{k-1} S_i$ and $S_k$ are disjoint for any $k\in\left\{2,\ldots,n\right\}$ since: \[ \left(\cup_{i=1}^{k-1} S_i\right) \cap S_k = \cup_{i=1}^{k-1} (S_i \cap S_k) = \cup_{i=1}^{k-1} \emptyset = \emptyset \] Now: \begin{eqnarray*} \mathbb{P}\left(\cup_{i=1}^n S_i\right) & = & \mathbb{P}\left(\cup_{i=1}^{n-1} S_i \cup S_n\right) = \mathbb{P}\left(\cup_{i=1}^{n-1} S_i\right) + \mathbb{P}\left(S_n\right)\\ & = & \mathbb{P}\left(\cup_{i=1}^{n-2} S_i \cup S_{n-1}\right) + \mathbb{P}\left(S_n\right) = \mathbb{P}\left(\cup_{i=1}^{n-2} S_i\right) + \mathbb{P}\left(S_{n-1}\right) + \mathbb{P}\left(S_n\right) = \ldots = \sum_i \mathbb{P}\left(S_i\right) \end{eqnarray*} Using this, it can be seen that if $\mathbb{P}$ is a function on the subsets of a finite set $S$ which has the properties above then the function $p_s' = \mathbb{P}\left(\left\{s\right\}\right)$ is the unique point mass function consistent with $\mathbb{P}$. Since, in this context, there is a one to one correspondence between point mass functions and probability distributions, we will sometimes call the point mass function $p_s$ a probability distribution. In the more advanced theory of probability, the focus is on the probability distribution for reasons we'll see later. The underlying space $S$ is referred to as the probability space though we will need to add a bit more mathematical structure to this space for the advanced theory.

Example: A Single Die

Consider a toss of a single die with sides labeled $1$ through $6$. The underlying set $S = \{1,2,3,4,5,6\}$. Assuming the die is fair, the probability distribution assigns an equal weight of $\frac{1}{6}$ to each side coming up, that is, $p_i = \frac{1}{6}$ for $i\in S$. Hence, for any subset $S' \subseteq S$, we have $\mathbb{P}\left(S'\right) = \frac{|S'|}{6}$.

Example: A Pair of Dice

We now consider the probability of rolling double 6's on a pair of dice. The underlying set $S = \left\{1,2,3,4,5,6\right\}^2$. We assume that the probability mass function $p_{(i,j)} = p_i * p_j$ where $p_i$ is the probability of rolling $i$ on a single die. We discuss this assumption in greater detail in a subsequent section. Hence, $p_{(6,6)} = p_6 p_6 = \frac{1}{6}^2 = \frac{1}{36}$.

Example: 24 Rolls of a Pair of Dice

Let us now consider the probability of rolling at least one pair of double 6's in 24 rolls of a pair of dice. The underlying set $S = \left(\left\{1,2,3,4,5,6\right\}^2\right)^{24}$. The last example of the previous section was predicated on the assumption that $p_{(i,j)} = p_i * p_j$. We make the equivalent assumption for this example, which is: \begin{eqnarray}\label{psindependence} p_{\left(s_{1,1},s_{1,2}\right), \left(s_{2,1}, s_{2,2}\right), \ldots, \left(s_{24,1},s_{24,2}\right)} = \prod_{j=1}^{24} p_{\left(s_{j,1},s_{j,2}\right)} \end{eqnarray} We'll discuss this assumption in detail later. Now, to calculate the probability: \begin{eqnarray*} \lefteqn{\mathbb{P}\left( \left\{ s \in \left(\left\{1,2,3,4,5,6\right\}^2\right)^{24} : s_{j,1} = 6 \mbox{ and } s_{j,2} = 6 \mbox{ for some } j\right\} \right)}\\ & = & \mathbb{P}\left( S - \left\{ s \in \left(\left\{1,2,3,4,5,6\right\}^2\right)^{24} : s_{j,1} \neq 6 \mbox{ or } s_{j,2} \neq 6 \mbox{ for all } j\right\} \right)\\ & = & \mathbb{P}\left( S \right) - \mathbb{P}\left(\left\{ s \in \left(\left\{1,2,3,4,5,6\right\}^2\right)^{24} : s_{j,1} \neq 6 \mbox{ or } s_{j,2} \neq 6 \mbox{ for all } j\right\} \right)\\ & = & 1 - \prod_{j=1}^{24}\mathbb{P}\left(\left\{ s \in \left(\left\{1,2,3,4,5,6\right\}^2\right)^{24} : s_{j,1} \neq 6 \mbox{ or } s_{j,2} \neq 6\right\} \right)\\ & = & 1 - \prod_{j=1}^{24}\left(1 - \mathbb{P}\left(\left\{s \in \left(\left\{1,2,3,4,5,6\right\}^2\right)^{24} : s_{j,1} = 6 \mbox{ and } s_{j,2} = 6\right\} \right)\right) = 1 - \left(1 - \left(\frac{1}{6}\right)^2\right)^{24} = 1 - \left(\frac{35}{36}\right)^{24} \end{eqnarray*} where $s_{j,i}$ denotes the ith die of the pair of dice in the jth roll.

Random Variables

In the last example of the previous section, the manipulation of the probability space became unwieldy. In this seciton we introduce the notion of a random variable which will make such calculations more intuitive and less unwieldy. A random variable is a function, $X$ from a probability space $S$ to the real numbers, that is, $X:S\rightarrow\mathbb{R}$. In more advanced theory an additional technical condition, called measurability, is added to this definition but this will not concern us in this course. In some treatments of elementary probability theory, joint distributions are introduced in place of random variables. However, using a probability space and random variables allows us to talk about a potentially infinite number of random variables that can be closely related to each other as we will see as the course goes on.

We will use capital roman letters such as $X, Y, \ldots$ to denote random variables as is standard. Random variables allow us to more easily specify sets from the probability space. In particular, instead of writing $\left\{ s : X(s) = x \right\}$, we will write $X = x$. As a further simplification of notation, we sometimes write $\mathbb{P}\left(A\cap B\right)$ as $\mathbb{P}\left(A,B\right)$ so that $\mathbb{P}\left(\left(X=x\right)\cap\left(Y=y\right)\right)$ can be written as $\mathbb{P}\left(X=x,Y=y\right)$. In order to demonstrate these notations, we reiterate the last example of the previous section using random variables.

Example: 24 Rolls of a Pair of Dice

Let $X_{j,i}$ denote the random variable corresponding ith die of the pair of dice in the jth roll of the 24 rolls of pairs of dice. In the notation from the previous example, $X_{j,i}(s) = s_{j,i}$. We will assume that: \begin{eqnarray}\label{rvindependence} \mathbb{P}\left( X_{j,i} = x, X_{j',i'} = x' \right) = \mathbb{P}\left( X_{j,i} = x\right) \mathbb{P}\left( X_{j',i'} = x' \right) \end{eqnarray} if $i,j$ and $i',j'$ are distinct, that is, either $i\neq i'$ or $j\neq j'$. We will discuss this assumption in greater detail in the next section. We have: \begin{eqnarray*} \lefteqn{\mathbb{P}\left( X_{j,1} = 6 \mbox{ and } X_{j,2} = 6 \mbox{ for some $j$ } \right)}\\ & = & 1 - \mathbb{P}\left( X_{j,1} \neq 6 \mbox{ or } X_{j,2} \neq 6 \mbox{ for all $j$ } \right)\\ & = & 1 - \prod_{j=1}^{24} \mathbb{P}\left( X_{j,1} \neq 6 \mbox{ or } X_{j,2} \neq 6 \right)\\ & = & 1 - \prod_{j=1}^{24} \left( 1 - \mathbb{P}\left( X_{j,1} = 6 \mbox{ and } X_{j,2} = 6 \right)\right)\\ & = & 1 - \prod_{j=1}^{24} \left( 1 - \mathbb{P}\left( X_{j,1} = 6 \right) \mathbb{P}\left( X_{j,2} = 6 \right)\right)\\ & = & 1 - \prod_{j=1}^{24} \left( 1 - \left(\frac{1}{6}\right)^2 \right) = 1 - \prod_{j=1}^{24} \frac{35}{36} = 1 - \left(\frac{35}{36}\right)^{24} \end{eqnarray*}

Conditional Probability and Independence

It is often important to know what the probability of some set $A$ given that we have observed that some set $B$ has occured. Using the notation of the previous example, we might wish to know the probability that $X_{1,1} = 6$ given that we know that $X_{1,1}$ is even. The is referred to as conditional probability, written $\mathbb{P}\left(\left.A\right|B\right)$, and is defined as: \begin{eqnarray}\label{condprob} \mathbb{P}\left(\left.A\right|B\right) = \frac{\mathbb{P}\left(A \cap B\right)}{\mathbb{P}\left(B\right)} \end{eqnarray} whenever $\mathbb{P}\left(B\right)>0$. In the example given above, the probability that $X_{1,1} = 6$ given that $X_{1,1}$ is even could be written as $\mathbb{P}\left(\left.X_{1,1}=6\right|X_{1,1} = 0 \mbox{ mod $2$}\right)$. Note that for sets $A_1, A_2, \ldots, A_n$, conditional probabilities have a telescoping property: \begin{eqnarray*} \mathbb{P}\left(A_1, A_2, \ldots, A_n\right) & = & \mathbb{P}\left(A_1, A_2, \ldots, A_{n-1}\right) \frac{\mathbb{P}\left(A_1, A_2, \ldots, A_n\right)}{\mathbb{P}\left(A_1, A_2, \ldots, A_{n-1}\right)}\\ & = & \mathbb{P}\left(A_1, A_2, \ldots, A_{n-1}\right) \mathbb{P}\left(\left.A_n\right|A_1, A_2, \ldots, A_{n-1}\right)\\ & = & \mathbb{P}\left(A_1, A_2, \ldots, A_{n-2}\right) \mathbb{P}\left(\left.A_{n-1}\right|A_1, A_2, \ldots, A_{n-2}\right)\mathbb{P}\left(\left.A_n\right|A_1, A_2, \ldots, A_{n-1}\right)\\\\ & = & \ldots = \mathbb{P}\left(A_1\right)\prod_{i=2}\mathbb{P}\left(\left.A_i\right|A_1,\ldots,A_{i-1}\right) \end{eqnarray*}

We now discuss the assumptions we made in (\ref{psindependence}) and (\ref{rvindependence}). When one sequentially rolls a die twice, one feels that the die rolls should have nothing to do with each other, that is, the fact that the first die roll comes out as say $6$ should have no bearing on the probability that the second die roll comes out as $6$. Another way of stating this is that the conditional probability that the second die roll equals $6$ given that the first die roll equals $6$ is the same as the probability that the second die roll equals $6$: \begin{eqnarray*} \mathbb{P}\left(\left.X_{1,2} = 6\right|X_{1,1} = 6\right) = \mathbb{P}\left(X_{1,2} = 6\right) \end{eqnarray*} In probability theory, we say that the die rolls are independent. Sets $A$ and $B$ are independent if: \begin{eqnarray}\label{independentcond} \mathbb{P}\left(\left.A\right|B\right) = \mathbb{P}\left(A\right) \end{eqnarray} which, given the definition of conditional probability, (\ref{condprob}), can be rewritten as: \begin{eqnarray*} \frac{\mathbb{P}\left(A \cap B\right)}{\mathbb{P}\left(B\right)} = \mathbb{P}\left(A\right)\\ \end{eqnarray*} or, if $\mathbb{P}\left(B\right) \neq 0$: \begin{eqnarray}\label{independent} \mathbb{P}\left(A \cap B\right) = \mathbb{P}\left( A \right) \mathbb{P}\left( B \right) \end{eqnarray} In fact, (\ref{independent}) is typically used as the definition of independence. It only differs from (\ref{independentcond}) when $\mathbb{P}\left(B\right) = 0$. Using (\ref{independent}), $A$ would always be independent of $B$ when $\mathbb{P}\left(B\right)=0$ whereas using (\ref{independentcond}), the independence of $A$ and $B$ would be undefined.

Independence can be extended to random variables. We say that random variables $X$ and $Y$ are independent if $X=x$ and $Y=y$ are independent for each $x$ and $y$. Note that these definitions need to be modified in more advanced probability theory.

Example: Independence Without Uniformity

In all of the examples we have given so far, which have involved fair dice or coin flips, each of the random variables had the same probability of each possible outcome. In a fair die roll, the probability of each of the $6$ outcomes is $\frac{1}{6}$. We now give an example of independent random variables which do not have this property. Consider random variables $X$ and $Y$ which each take values in $\{0,1\}$ governed by the point mass function given by the following table:


From the table we can see that: \begin{eqnarray*} \mathbb{P}\left( X = 0, Y = 0 \right) = \frac{3}{15} = \frac{9}{15} \times \frac{5}{15} = \mathbb{P}\left( X = 0 \right) \mathbb{P}\left( Y = 0 \right) \end{eqnarray*} so that $X=0$ and $Y=0$ are independent. Similarly, all the other outcomes for the random variables $X$ and $Y$ can be shown to be independent so that the random variables $X$ and $Y$ are themselves independent.

Example: Pairwise Independence Does Not Imply Triplet-wise

We have defined independence in terms of pairs of sets or random variables. However, if sets $A$, $B$ and $C$ are each pairwise independent, that does not imply that combinations of these sets are independent. Consider the example of $3$ random variables, $X$, $Y$ and $Z$, each of which takes values in $\{0,1\}$ governed the point mass function given by the following table:


The sum of each row of this table is $\frac{1}{4}$. Therefore: \begin{eqnarray*} \mathbb{P}\left( X = x, Y = y \right) = \frac{1}{4} = \frac{1}{2} \times \frac{1}{2} = \mathbb{P}\left( X = x \right)\mathbb{P}\left( Y = y \right) \end{eqnarray*} so that $X$ and $Y$ are independent. The corresponding sums for $X$ and $Z$ and for $Y$ and $Z$ also total $\frac{1}{4}$ so that $X$ and $Z$ are independent and $Y$ and $Z$ are independent. However: \begin{eqnarray}\label{nonnway} \mathbb{P}\left( X = 0, Y = 0, Z = 0 \right) = \frac{3}{16} \neq \frac{1}{8} = \frac{1}{4} \times \frac{1}{2} = \mathbb{P}\left( X = 0, Y = 0 \right) \mathbb{P}\left( Z = 0 \right) \end{eqnarray}

$n$-way Independence

The previous example demonstrates that pairwise independence of $n$ random variables doesn't imply that sets of outcomes from a subset of the random variables are independent of sets of outcomes from the remaining variables. In order to ensure this, we need the variables to be $n$-way independent. A set of random variables $X_1, X_2, \ldots, X_n$ are independent if for every set of outcomes of these variables, $x_1, x_2, \ldots, x_n$, we have: \begin{eqnarray} \mathbb{P}\left( X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n \right) = \prod_{i=1}^n \mathbb{P}\left( X_i = x_i \right)\label{independence} \end{eqnarray} It can be seen that under this condition, situations like (\ref{nonnway}) would be impossible.

$n$-way independence of sets doesn't imply $2$-way independence

We consider the equivalent of (\ref{independence}) for sets $A_1, A_2, \ldots, A_n$: \begin{eqnarray*} \mathbb{P}\left(A_1, A_2, \ldots, A_n\right) = \prod_{i=1}^n \mathbb{P}\left(A_i\right) \end{eqnarray*} We consider an example from Shiryaev of rolling a pair of dice. Let $X_1$ be the outcome of the first die and $X_2$ the second. We consider the following events: \begin{eqnarray*} A = \left\{X_1\in\left\{1,2,5\right\}\right\}\\ B = \left\{X_1\in\left\{4,5,6\right\}\right\}\\ C = \left\{X_1 + X_2 = 9\right\}\\ \end{eqnarray*} The probabilities for this example are given by:

$\not\in C$$\in C$
$\not\in A$$\not\in B$$\frac{5}{36}$$\frac{1}{36}$
$\not\in A$$\in B$$\frac{5}{18}$$\frac{1}{18}$
$\in A$$\not\in B$$\frac{1}{3}$$0$
$\in A$$\in B$$\frac{5}{36}$$\frac{1}{36}$

From this, we can see that: \begin{eqnarray*} \mathbb{P}\left(A, B, C\right) = \frac{1}{36} = \frac{1}{2} \frac{1}{2} \frac{1}{9} = \mathbb{P}\left(A\right) \mathbb{P}\left(B\right) \mathbb{P}\left(C\right) \end{eqnarray*} However: \begin{eqnarray*} \mathbb{P}\left(A, B\right) = \frac{1}{6} \neq \frac{1}{4} = \frac{1}{2} \frac{1}{2} = \mathbb{P}\left(A\right) \mathbb{P}\left(B\right) \end{eqnarray*} In fact, we also have: \begin{eqnarray*} \mathbb{P}\left(A, C\right) \neq \mathbb{P}\left(A\right) \mathbb{P}\left(C\right)\\ \mathbb{P}\left(B, C\right) \neq \mathbb{P}\left(B\right) \mathbb{P}\left(C\right)\\ \end{eqnarray*} However, if we assume that independence holds not only for the sets $A,B,C$ but also for any combinations of the complements of these sets, $S-A, S-B, S-C$, and the sets themselves, that is: \begin{eqnarray*} \mathbb{P}\left(S - A, S - B, S - C\right) & = & \mathbb{P}\left(S - A\right)\mathbb{P}\left(S - B\right)\mathbb{P}\left(S - C\right)\\ \mathbb{P}\left(S - A, S - B, C\right) & = & \mathbb{P}\left(S - A\right)\mathbb{P}\left(S - B\right)\mathbb{P}\left(C\right)\\ \mathbb{P}\left(S - A, B, S - C\right) & = & \mathbb{P}\left(S - A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(S - C\right)\\ \mathbb{P}\left(S - A, B, C\right) & = & \mathbb{P}\left(S - A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(C\right)\\ \mathbb{P}\left(S - A, S - B, S - C\right) & = & \mathbb{P}\left(A\right)\mathbb{P}\left(S - B\right)\mathbb{P}\left(S - C\right)\\ \mathbb{P}\left(A, S - B, C\right) & = & \mathbb{P}\left(A\right)\mathbb{P}\left(S - B\right)\mathbb{P}\left(C\right)\\ \mathbb{P}\left(A, B, S - C\right) & = & \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(S - C\right)\\ \mathbb{P}\left(A, B, C\right) & = & \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(C\right)\\ \end{eqnarray*} If these were true, then, for example: \begin{eqnarray*} \mathbb{P}\left(A, B\right) & = & \mathbb{P}\left(A,B,S-C\right) + \mathbb{P}\left(A,B,C\right)\\ & = & \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(S-C\right) + \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(C\right)\\ & = & \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\left(1 - \mathbb{P}\left(C\right)\right) + \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\mathbb{P}\left(C\right)\\ & = & \mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\left(1 - \mathbb{P}\left(C\right) + \mathbb{P}\left(C\right)\right)\\ & = & \mathbb{P}\left(A\right)\mathbb{P}\left(B\right) \end{eqnarray*}

Conditional Independence

Sometimes, sets or random variables might not be independent but will be independent given some other random variable. We say that a set $A$ in conditionally independent of a set $B$ given a set $C$ if: \begin{eqnarray*} \mathbb{P}\left(\left. A \cap B \right| C \right) = \mathbb{P}\left(\left. A \right| C\right) \mathbb{P}\left(\left. B\right| C\right) \end{eqnarray*} Note that, if the probabilities are non-zero, this can also be stated as: \begin{eqnarray*} \mathbb{P}\left(\left.A\right|B \cap C\right) = \mathbb{P}\left(\left.A\right|C\right) \end{eqnarray*} The notion of conditional independence carries over to random variables and multiple random variables in the same way that basic independence does.

Determining Independence

We use (\ref{independent}) as the mathematical definition of independence. There is also an intuitive notion of independence: one feels that the result of one die roll or coin flip should not affect the probability of outcomes on other such rolls or flips. Whether a random variable is independent of another random variable in practice is not covered by the mathematical definition. One needs to argue from the physics of rolling dice or flipping coins in order to determine that random variables are truly independent. In some cases, one might believe one has independent replications of random variables $X$ and $Y$ which can be used to help determine if (\ref{independent}) holds. However, in practice, it is impossible to have certainty that two random variables are independent. For example, if one were to test whether every other sample from a pseudo-random number generator on a computer where independent of the other samples, almost every conceivable statistical test would indicate that they were independent. However, they are not truly independent. If one knows they rule, one can often determine the next sample completely from the previous sample. In fact, one could mathematically prove that a statistical test for independence can't be computed. However, this is not the topic of this course. The topic of this course is Markov chains, the earliest mathematical divergence from independence.

Identical Distribution and the Law of Large Numbers

The law of large numbers is a basic result of advanced probability theory which formalizes the intuition that the frequency of occurence of a certain event, say at die coming up as $6$, will eventually converge to the probability of that event. We formalize this by considering an infinite sequence of random variables, $X_1, X_2, \ldots$. We consider the frequency, that is, the percentage of times, that $X_i = x$ for some value $x$. We would like to show that this converges to the probability that $X_i = x$. To add meaning to this, we need that $\mathbb{P}\left(X_i =x\right) = \mathbb{P}\left(X_j = x\right)$ for all $i$ and $j$. When this occurs for every $x$, we say that $X_1, X_2, \ldots$ have identical distributions. However, the frequencies need not converge to the probabilities for random variables with identical distributions. For example, if $X_1 = X_2 = X_3 = \ldots$ then the frequency will be $1$ for whichever value happened to occur for $X_1$ and $0$ otherwise which need not equal the probability. If, however, the random variables are independent and identically distributed (typically written IID), then the frequency is guaranteed to converge to the probability.


  1. A book that provides a good introduction to probability theory which begins at an elementary level and covers some modern advanced material is:

    Venkatesh, Santosh S. The Theory of Probability Explorations and Applications. Cambridge University Press, 2013.

  2. A more advanced introduction is:

    Billingsley, Patrick. Probability and Measure. Wiley, 2012.

  3. Another advanced book is:

    Shiryayev, A. N. Probability. Springer-Verlag, 1984.