Probability Densities and Expected Values

Densities

All points have probability $0$

Consider a random variable which can take values in a continuum such a spinner in a board game. One expects that each possible angle at which a spinner on a board game could land, say between $0$ and $2\pi$, comes up equally likely. What is the probability of a particular value? Suppose it is some $p>0$. In this section, we showed that the probability of a finite union of disjoint sets is equal to the sum of the probabilities of the invidiual sets. Since the probability of a set is at least as much as the probability of any subset: \begin{eqnarray*} \mathbb{P}\left([0,2\pi)\right) \geq \sum_{i=0}^{n-1} \mathbb{P}\left(\left\{\frac{2\pi i}{n}\right\}\right) = \sum_{i=0}^{n-1} p = n p \end{eqnarray*} for any $n$. If we choose $n = \left\lceil\frac{1}{p} + 1\right\rceil$ where $\lceil x\rceil$ is the smallest integer larger than $x$, we get that $\mathbb{P}\left([0,2\pi)\right) \geq \left\lceil\frac{1}{p} + 1 p\right\rceil \geq 1 + p$ which is a contradiction. Hence, it can't be true that $p>0$. The probability of each point is $0$. A similar analysis will hold for any probability distribution, not necessarily uniformly distributed, for which the probability

Probability densities as probability distributions

Let us now suppose that we have a function $f:\mathbb{R}\rightarrow [0,\infty)$ such that $\int_{-\infty}^\infty f(x) dx = 1$. For any interval $[a,b)$ with $a < b$, we define a probability distribution $\mathbb{P}$ by: \begin{eqnarray*} \mathbb{P}\left([a,b)\right) = \int_a^b f(x) dx \end{eqnarray*} Furthermoere, we extend $\mathbb{P}$ to unions of disjoint intervals $[a_1,a_2), [a_2, a_3), \ldots, [a_{n-1},a_n)$ with $a_i < a_{i+1}$ via: \begin{eqnarray*} \mathbb{P}\left(\cup_{i=1}^{n-1} [a_i,a_{i+1})\right) = \sum_{i=1}^{n-1} \mathbb{P}\left([a_i,a_{i+1})\right) \end{eqnarray*} It can be seen that $\mathbb{P}$ as defined above satisfies the axioms for a probability distribution. The function $f$ described above is referred to as a probability density and, as with point mass functions, represent an important method to define a probability distribution. Note that, for a density function $f$, we have: \begin{eqnarray*} \mathbb{P}\left([x-\epsilon,x+\epsilon)\right) = \int_{x-\epsilon}^{x+\epsilon} f(x) dx \approx f(x) \left(x + \epsilon - \left( x - \epsilon\right) \right) = 2\epsilon \end{eqnarray*} Hence, since $\epsilon$ can be chosen arbitrarily small, we have $\mathbb{P}\left(\left\{x\right\}\right) = 0$. All points have $0$ probability for a probability density defined by a probability density.

Example: Uniform density

For the example of the board game spinner given at the beginning of the section, the density is uniform, that is, the same at all points in some interval $[a,b]$ and $0$ elsewhere: \begin{eqnarray*} f(x) = \left\{\begin{array}{ll} c & \mbox{if $x\in[a,b]$}\\ 0 & \mbox{otherwise}\\ \end{array}\right. \end{eqnarray*} Since a density has to integrate to $1$, we have: \begin{eqnarray*} \int_{-\infty}^\infty f(x) dx = \int_a^b c dx = c(b - a) = 1 \end{eqnarray*} So that $c = \frac{1}{b-a}$.

Example: Normal density

One of the most important probability distributions in all of probability theory is the normal distribution. The normal distribution is defined by the density given by: \begin{eqnarray*} f(x) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^2}{2}\right) \end{eqnarray*} As we will later present, the central limit theorem is a fundamental result which shows that well-behaved IID random variables, when properly shifted and scaled, converge to a normal distribution.

Consider a random variable $X$ which takes on values in a continuum. We define the cumulative distribution function of $X$ as follows: \begin{eqnarray*} F_X(x) = \mathbb{P}\left(X\leq x\right) \end{eqnarray*} If $F_X$ is differentiable, its derivative is the probability density of $X$, that is, $f_X(x) = \frac{dF_X(x)}{dx}$. Note that the distribution of every random variable can be defined by a cumulative distribution function, though there are some whose distribution can't be defined in terms of point mass functions and probability densities.

Expected values

Let $\mathbb{P}$ be a probability distribution. Let $X$ be a random variable which takes values in a finite set $\left\{x_1, x_2, \ldots, x_n\right\}$. The average, or expected value of $X$, written $E\left[X\right]$, is defined as: \begin{eqnarray} E\left[X\right] = \sum_{i=1}^n \mathbb{P}\left(X=x_i\right) x_i\label{expectedvalue} \end{eqnarray}

We now consider random variables which take on an infinite number of values. Consider a random variable $X$ that takes on values $1, 2, 3, \ldots$ with probabilities given by: \begin{eqnarray*} \mathbb{P}\left(X = i\right) = \frac{6}{\pi^2} \frac{1}{i^2} \end{eqnarray*} It has been shown by the famous mathematician Euler that these probabilities sum to $1$. The expected value, as defined by (\ref{expectedvalue}) would be: \begin{eqnarray*} E\left[X\right] = \sum_i \mathbb{P}\left(X=i\right) i = \sum_i \frac{6}{\pi^2}\frac{i}{i^2} = \frac{6}{\pi^2} \sum_i \frac{1}{i} = \infty \end{eqnarray*} The expected value can be infinite. In fact, expected value of the points on the left side, $X\leq 0$, could be $-\infty$, and of points on the right side, $X\geq 0$, $\infty$ in which case the expected value is $-\infty + \infty$ which is undefined.

For a continuous random variable $X$, with density $f(x)$, we define its expected value as: \begin{eqnarray*} E\left[X\right] = \int_{-\infty}^\infty x f(x) dx \end{eqnarray*} Similarly to discrete random variables, continuous random variables can have infinite or undefined expectations.

Recall that a random variable is a function from a probability space to $\mathbb{R}$. For random variables, $X$ and $Y$ and $\alpha\in\mathbb{R}$, we define addition and scalar multiplication as follows: \begin{eqnarray*} (X + Y)(s) = X(s) + Y(s)\\ \left(\alpha X\right)(s) = \alpha X(s) \end{eqnarray*} With these definitions, the set of random variables becomes a vector space. It can be shown that expected value, for probability distributions defined by point mass functions or probability densities is a linear function on the vector space of random variables: \begin{eqnarray*} E\left[\alpha X + \beta Y\right] = \alpha E\left[X\right] + \beta E\left[Y\right] \end{eqnarray*}

Mean and location

The mean of a random variable $X$, typically written as $\mu$, is its expected value. If $X$ is a random variable with $0$ mean, then $X+\mu$ is a random variable with mean $\mu$: \begin{eqnarray*} E\left[X + \mu\right] = E\left[X\right] + \mu = \mu \end{eqnarray*} If $X$ has density $f$, then $X+\mu$ has density $f_\mu(x) = f(x - \mu)$, which shifts the mean by $\mu$: \begin{eqnarray*} \int_{-\infty}^\infty x f\left(x - \mu\right) dx = \int_{-\infty}^\infty (x' + \mu) f\left(x'\right) dx' = 0 + \mu = \mu \end{eqnarray*} where we have used the substituion $x' = x - \mu$. Since the mean of a random variable $X$ need not exist, when we parameterize a probability density by a shift, that is, $f_\mu\left(x\right) = f\left(x - \mu\right)$, we sometimes refer to $\mu$ as the location parameter.

Expected value of functions of random variables

Let $g:\mathbb{R}\rightarrow\mathbb{R}$ be a function. Since a random variable $X$ is a function from a probablity space to $\mathbb{R}$, the composition of $g$ with $X$, written $g(X)$ is also a random variable. We can take the expected value of $g(X)$, written $E\left[g(X)\right]$. An important case is the variance, described below.

Variance, standard deviation and scale

Since $E\left[X\right]$ is a real number, we can define a function $g(x) = \left(x - E\left[X\right]\right)^2$. The expected value of this function is called the variance, typically written $\sigma^2$, of $X$: \begin{eqnarray*} \sigma^2 = E\left[\left(X - E\left[X\right]\right)^2\right] \end{eqnarray*} Note that multiplying a random variable by a factor $\alpha$ causes its variance to scale quadratically: \begin{eqnarray*} E\left[\left(\alpha X - E\left[\alpha X\right]\right)^2\right] = E\left[\left(\alpha \left(X - E\left[X\right]\right)\right)^2\right] = \alpha^2 E\left[\left(X - E\left[X\right]\right)^2\right] \end{eqnarray*} The standard deviation, typically written as $\sigma$, is the square root of the variance. The standard deviation scales linearly for positive scales $\alpha>0$: \begin{eqnarray} \sqrt{E\left[\left(\alpha X - E\left[\alpha X\right]\right)^2\right]} = \sqrt{\alpha^2 E\left[\left(X - E\left[X\right]\right)^2\right]} = \alpha \sqrt{E\left[\left(X - E\left[X\right]\right)^2\right]}\label{sdscale} \end{eqnarray} As for the mean, we can shift the standard deviation by scaling a random variable. If $X$ is a random variable with standard deviation $1$ and $\alpha>0$, then $\alpha X$ has standard deviation $\alpha$ by (\ref{sdscale}). If $X$ has density $f$ with standard deviation $1$ and $\alpha>0$ then $\alpha X$ has density $f_\alpha(x) = \frac{1}{\alpha}f\left(\frac{X}{\alpha}\right)$. This can be seen by using the cumulative distribution function: \begin{eqnarray*} F_\alpha(x) = \mathbb{P}\left( \alpha X \leq x \right) = \mathbb{P}\left( X \leq \frac{x}{\alpha} \right) = F\left(\frac{x}{\alpha}\right) \end{eqnarray*} and so: \begin{eqnarray*} f_\alpha(x) = \frac{d}{dx} F_\alpha(x) = \frac{d}{dx} F\left(\frac{x}{\alpha}\right) = F'\left(\frac{x}{\alpha}\right)\frac{1}{\alpha} = \frac{1}{\alpha} f\left(\frac{x}{\alpha}\right) \end{eqnarray*} Since the standard deviation could be infinite, we sometimes call the parameter $\alpha$ of $f_\alpha$ the scale parameter.

Some Classical Limit Theorems

Some of the most remarkable results stemming from the prodigious development of probability theory by the Russian school in the first half of the twentieth century were what are sometimes referred to as the classical limit theorems. These include the law of large numbers, the central limit theorem and the law of the iterated logarithm. We present the first two here.

The Law of Large Numbers

If one makes independent tosses of a fair die or coin, one expects that the frequency of an outcome, say a $6$ coming up on a die toss, will converge to the probability of that outcome. Similarly, one would expect the average of the die rolls to converge to their expected value, in this case, $3\frac{1}{2}$. We define the indicator function for a set $S$ as: \begin{eqnarray*} \mathbb{1}_S(x) = \left\{\begin{array}{ll} 1 & \mbox{if $x\in S$}\\ 0 & \mbox{otherwise} \end{array}\right. \end{eqnarray*} Note that for a random variable $X$, the indicator $\mathbb{1}_S(X)$ is a random variable with values $0$ or $1$. Hence, its expected value is given by: \begin{eqnarray*} E\left[\mathbb{1}_S(X)\right] = \sum_{i=0}^1 \mathbb{P}\left(\mathbb{1}_S(X)=i\right) i = \mathbb{P}\left(\mathbb{1}_S(X)=0\right) 0 + \mathbb{P}\left(\mathbb{1}_S(X)=1\right) 1 = \mathbb{P}\left(X\in S\right) \end{eqnarray*} Hence, if we know that the averages converge to the expected values, we also know that the frequencies converge to the probabilities. This is exactly what the law of large numbers tells us:

If $X_1, X_2, \ldots$ are IID random variables such that $E\left[X_i\right]$ exists, then: \[ \lim_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^n X_i = E{\left[X_i\right]} \] with probability $1$.

Note that the IID assumption breaks into two different assumptions: independence and identical distribution. There are versions of the law of large numbers in which the conclusion is the same but in which each of these assumptions can be relaxed somewhat.

The Central Limit Theorem

The law of large numbers tells us that, for IID random variables $X_1, X_2, \ldots$: \begin{eqnarray*} \lim_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^n \left( X_i - E\left[X_i\right] \right) = 0 \end{eqnarray*} What happens if we eliminate the $\frac{1}{n}$ factor? In this case, the process corresponding to the sum is called a random walk and is Markovian. Unless it is degenerate and $X_i$ is constant, the random walk will oscillate between becoming arbitrarily large and arbitrarily small. This Markov process has no stationary distribution. However, if we normalize by the more rate $\frac{1}{\sqrt{n}}$, which shrinks more slowly than $\frac{1}{n}$, we get the central limit theorem which tells us the distribution of this quantity:

If $X_1, X_2, \ldots$ are IID random variables with finite variance, then the normalized sum: \begin{eqnarray*} \frac{1}{\sqrt{n}}\sum_{i=1}^n \left(X_i - E\left[X_i\right]\right) \end{eqnarray*} asymptotically has a normal probability density: \begin{eqnarray*} f_\sigma(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left( -\frac{x^2}{2\sigma^2} \right) \end{eqnarray*} where $\sigma$ is the common standard deviation of the $X_i$. In particular: \begin{eqnarray*} \lim_{n\rightarrow\infty} \mathbb{P}\left( a \leq \frac{1}{\sqrt{n}}\sum_{i=1}^n \left(X_i - E\left[X_i\right]\right) \leq b\right) = \int_a^b \frac{1}{\sigma\sqrt{2\pi}}\exp\left( -\frac{x^2}{2\sigma^2} \right) dx \end{eqnarray*}

The IID assumption can be relaxed in the central limit theorem in a similar manner as in the law of large numbers.

Jensen's Inequality

We now introduce Jensen's inequality, one of the most important results in applied probability theory:

For any convex function $f(x)$ and any random variable $X$, we have: \[ E{\left[f{\left(X\right)}\right]} \geq f{\left(E{\left[X\right]}\right)} \]