These notes are based on Chapter 13 & 14 of *Jones, O., Maillardet, R., & Robinson, A. (2009). Introduction to scientific programming and simulation using R. CRC Press.*

Probability is an essential tool for developing and interpreting stochastic simulations, which are very useful for gaining insight about situations that are too complex to analyse theoretically.

Probability theory starts with the axioms of mathematical structure that allows us to model experiments with random outcomes and quantify the likelihood of each result of the experiments.

In these notes, to minimise distractions, I try to avoid measure-theoretic notions, which are fundamental to the rigorous probability theory. I believe, for our purpose, benefits are much bigger than costs. For example, technically speaking, not all subsets of the unit interval \([0,1]\) qualify as events, because the power set is too large or its elements are too many to handle. The most common collection, which is somewhat smaller but not too small to be useful, is called the Borel \(\sigma\)-algebra.

- Sample space : the set of all possible outcomes
- Sample point : an element of sample space
- Event : a subset of sample space

For example, imagine an experiment in which we roll a die. The sample space is \(\{1,2,3,4,5,6\}\). Any combination of six numbers (or lack thereof) such as \(\{2,4,6\}\) and \(\{1\}\) is an event. In this case, therefore, a collection of all events is a power set of size \(2^6=64\).

While this is a simple case, how we define sample space is often important. For example, if we roll a die twice (or roll two dice at once), we have two reasonable choices: ordered or unordered pairs. The choice probably depends on how we model problems of interest.

Notations:

- \(\Omega\) : sample space
- \(\omega\) : sample point
- \(A\), \(B\), … : events
- \(\emptyset\) : empty set
- \(|A|\) : number of elements in \(A\)

Basic operations on sets

- \(A\setminus B = \{\omega\in\Omega : \omega \in A \;\text{ and }\; \omega \notin B\}\)
- \(\overline{A} = \Omega\setminus A\)
- \(A\cup B = \{\omega\in\Omega : \omega \in A \;\text{ or }\; \omega \in B\}\)
- \(A\cap B = \{\omega\in\Omega : \omega \in A \;\text{ and }\; \omega \in B\}\)
- \(\overline{A\cup B} = \overline{A} \cap \overline{B}\)
- \(\overline{A\cap B} = \overline{A} \cup \overline{B}\)

When it comes to simple set operations, a Venn diagram is our friend.

A probability function \(P\) assigns a value in \([0,1]\) to an event \(A\subseteq\Omega\). Intuitively, \(P(A)\) is the answer to the question of “How likely is event \(A\)?”

A probability function \(P\) must be constructed to satisfy the following three properties:

- \(0\leq P(A)\leq 1\) for \(A\subseteq\Omega\)
- \(P(\emptyset) = 0\) and \(P(\Omega) = 1\)
- For a collection of mutually exclusive events \(\{A_k\}\) for \(k\in\{1,2,\dots\}\), \[P\left(\bigcup_k A_k\right) = \sum_k P(A_k) .\]

\(\bigcap_k A_k = \emptyset\) if \(\{A_k\}\) is a collection of mutually exclusive events.

Suppose \(\Omega\) is a finite set of size \(|\Omega|=n\) and each element is *equally likely*. Then, \[ P(A) = \frac{|A|}{n} ,\] and we just need to count the elements of \(A\) to compute \(P(A)\).

In contrast, in an experiment of rolling an unfair die with \(6\) twice more likely than the others, we have the following probability function:

\(\omega\) | \(1\) | \(2\) | \(3\) | \(4\) | \(5\) | \(6\) |
---|---|---|---|---|---|---|

\(P(\{\omega\})\) | 1/7 | 1/7 | 1/7 | 1/7 | 1/7 | 2/7 |

I roll a die. You predict and shout out “#1!”. Your chance to get it right is \(1/6\). OK.

Next, I roll a die, and you shout out “#1!” again because you like #1. But this time, I tell you I have got an odd number. Now, you suddenly have a better chance, don’t you?

We can model this experiment as follows:

- \(\Omega = \{1,2,3,4,5,6\}\)
- \(A = \{1\}\)
- \(B = \{1,3,5\}\)

Your better chance is called a *conditional probability* and computed as follows: \[P(A|B) = \frac{P(A\cap B)}{P(B)} = \frac{P(A)}{P(B)} = \frac{1/6}{1/2} = \frac{1}{3} .\]

Note that we use \(A\cap B = A\) because \(A\subset B\).

Perhaps, it makes more sense if we write \(P(A\cap B) = P(A|B)P(B)\). The probability that both \(A\) and \(B\) happen is equal to the probability that \(B\) happens first and, treating *\(B\) as new sample space*, \(A\) happens next.

In other words, provided \(P(B)>0\), we can restrict the sample space \(\Omega\) to \(B\) and use \(P(\cdot|B)\) as a legitimate probability function that takes a subset of \(B\) and returns a value in \([0,1]\).

Two events \(A\) and \(B\) are independent if \(P(A\cap B) = P(A)P(B)\). To interpret “independence”, just rearrange the condition as follows: \[\begin{alignat}{2} && P(A\cap B) &= P(A)P(B)\\ &\Leftrightarrow\quad& P(A|B)P(B)&=P(A)P(B)\\ &\Leftrightarrow\quad& P(A|B)&=P(A), \end{alignat}\] which we may read as “Restricting the sample space to \(B\) or knowing \(B\) happened first does not change the probability of \(A\).”

To get some intuition, think of rolling two dice, one in red and the other in blue. \(\Omega\) is the collection of all the ordered pairs: \[\Omega = \{\omega=(\omega_r,\omega_b) : \omega_r,\omega_b\in\{1,2,\dots,6\}\} .\] Now, suppose: \[A = \{\omega=(\omega_r,\omega_b) : \omega_r = 1\}\] and \[B = \{\omega=(\omega_r,\omega_b) : \omega_b = 6\} .\] If you roll the red die first and see \(1\) up (i.e., \(A\) happened), is the probability that the blue die turns with \(6\) up different from \(1/6\)? No, it should not be.

Concretely, the size of \(B\) with \(\Omega\) as sample space is \[|B| = |\{(1,6),(2,6),\dots,(6,6)\}| = 6 .\] In contrast, the size of \(B\) with \(A\) as sample space is \[|B \text{ given } A| = |\{(1,6)\}| = 1 .\] Hence, \[P(B) = \frac{|B|}{|\Omega|} = \frac{6}{36} = \frac{1}{6} = \frac{|B \text{ given } A|}{|A|} = P(B|A) .\]

A partition of a set \(\Omega\) is a collection of non-empty subsets \(E_1,E_2,\dots\) such that every element \(\omega\in\Omega\) is in exactly one of these subsets. In other words, sample space is a union of mutually exclusive events: \[\Omega = \bigcup_k E_k .\] Partitioning helps us split a complex event \(A\) into a sequence of simpler events \(A\cap E_1\), \(A\cap E_2\), \(\dots\). This leads to the law of total probability.

Suppose events \(E_1,E_2,\dots,E_K\) partitions the sample space \(\Omega\). Then,

\[\begin{align} P(A) &= P(A\cap\Omega)\\ &= P(A\cap(E_1\cup E_2\cup\dots\cup E_K))\\ &= P((A\cap E_1)\cup(A\cap E_2)\cup\dots\cup(A\cap E_K)))\\ &= P((A\cap E_1)+P(A\cap E_2)+\dots+P(A\cap E_K)))\\ &= P(A|E_1)P(E_1)+P(A|E_2)P(E_2)+\dots+P(A|E_K)P(E_K) \end{align}\]

Using the conditional probability formula, \(P(A\cap B)\) can be expressed in two ways:

- \(P(A\cap B) = P(B|A)P(A)\)
- \(P(A\cap B) = P(A|B)P(B)\)

Equating them, we have \(P(A|B)P(B) = P(B|A)P(A)\). Or, using a partition \(\Omega = E_1\cup\dots\cup E_K\) and setting \(B=E_k\) for some \(k\in\{1,2,\dots,K\}\), \[\begin{align} P(E_k|A) &= \frac{P(A|E_k)P(E_k)}{P(A)}\\ &= \frac{P(A|E_k)P(E_k)}{P(A|E_1)P(E_1)+\dots+P(A|E_K)P(E_K)} \end{align}\] which is given a special name — Bayes’ formula (aka Bayes’ rule or theorem).

Some comments:

- Bayes’ formula allows us to turn around conditional probabilities from \(P(A|E_k)\) to \(P(E_k|A)\). This can be useful. For instance, \(E_1,\dots,E_K\) are competing hypotheses, and we may have information \(P(A|E_k)\) for each hypothesis. Then, \(A\) can be used as evidence to assess their probabilities \(P(E_k|A)\).
- In Bayesian statistics, \(P(E_k)\) is called “prior” information, while \(P(E_k|A)\) is called “posterior” information. Bayes’ formula updates the prior to the posterior using collected data \(A\) and \(P(A|E_k)\) as the “likelihood” of \(A\) observed under \(E_k\).

Suppose we have a COVID-19 test that detects a coronavirus \(96\%\) of the time, but gives false positives \(2\%\) of the time. Assume that \(5\%\) of the population carries the virus. If a random person tests positive, what is the probability that they actually carry the virus?

First, define the following events.

- \(A\) : “Person tests positive”
- \(E_1\) : “Person has the virus”
- \(E_2\) : “Person does not have the virus”

(N.B. \(E_2 = \overline{E}_1\).) Then, we are given the following probabilities: \[ P(A|E_1) = \frac{96}{100},\quad P(A|E_2) = \frac{2}{100},\quad P(E_1) = \frac{5}{100} .\] Now, applying Bayes’ formula, \[P(E_1|A) = \frac{\frac{96}{100}\cdot\frac{5}{100}}{\frac{96}{100}\cdot\frac{5}{100}+\frac{2}{100}\cdot\frac{95}{100}} \approx 0.716 .\] Probably, the most critical piece of information is what proportion of the population carries the virus. If our estimate changes from 5% to 1%, then we have \(P(E_1|A)\approx 0.327\). If the positive result implied only 33% chance of infection, I expect, much of public policies and people’s behaviour would be different.

When modelling practical problems, it is often the case that random outcomes are expressed in numerical values and/or we are interested in attaching numerical values to random outcomes. For example, when rolling a pair of dice, it may be the sum of the two dice that is of interest. Random variable is a mathematical device we use for this purpose.

Given the sample space \(\Omega\) and probability function \(P\), a random variable \(X\) is a function from the sample space to the real line, \(X : \Omega \to \mathbb{R}\).

I know it’s weird. As opposed to lower case letters, a capital letter \(X\) is used for a function yet called “variable”. Despite the name “random” variable, it is just a deterministic function. Unfortunately, it is the convention.

Let \(A\) be a subset of the real line, \(A\subset\mathbb{R}\). Then, to express the probability that “\(X\) lies in \(A\)”, we write \(P(X\in A)\), which is a shorthand for \(P(\{\omega\in\Omega : X(\omega)\in A\})\).

We use a notation \(f(a-)\) and \(f(a+)\) to mean the limit of \(f(x)\) as \(x\) approaches \(a\), respectively, from the left and from the right.

Using the example of rolling two dice, we illustrate four random variables:

- \(X_1\) : the outcome of the red die
- \(X_2\) : the outcome of the blue die
- \(X_3\) : the sum of the two dice
- \(X_4\) : the number of rolls that give one or two

Formally, \[ X_1(\omega_r,\omega_b) = \omega_r,\; X_2(\omega_r,\omega_b) = \omega_b,\; X_3(\omega_r,\omega_b) = X_1+X_2 = \omega_r+\omega_b .\] To express \(X_4\), let me first introduce an *indicator function* of a set \(B\). It takes \(x\) as an argument, and returns \[
I_B(x) =
\begin{cases}
1, & \text{if } x\in B\\
0, & \text{if } x\notin B
\end{cases}.
\] With this, we may write \[X_4(\omega_r,\omega_b) = I_{\{1,2\}}(\omega_r) + I_{\{1,2\}}(\omega_b) .\] Now, suppose an outcome turns out \(\omega=(1,6)\). Then, we have the following realizations of the random variables: \[ X_1(1,6) = 1,\; X_2(1,6) = 6,\; X_3(1,6) = 7 ,\] and \[X_4(1,6) = I_{\{1,2\}}(1) + I_{\{1,2\}}(6) = 1 + 0 = 1 .\] How about the probability that the sum of two dice is equal to a specific value, say 8? First, realize that each element of the event \(\{(\omega_r,\omega_b):\omega_r+\omega_b=8\}\) is equally likely. So, this is a counting problem, and we have five possible outcomes: \((2,6)\), \((3,5)\), \((4,4)\), \((5,3)\), and \((6,2)\). Thus, \[\begin{align}
P(X_3=8) &= P(\{(\omega_r,\omega_b) : \omega_r+\omega_b = 8\})\\
&= \frac{5}{36}
\end{align}\]

When interested in an event where \(X\) lies in a singleton set, we often write \(P(X=x)\) to mean \(P(X\in\{x\})\).

We can extend the concept of independent events to independent random variables. Random variables \(X\) and \(Y\) are independent if any event defined using \(X\) is independent of any event defined using \(Y\). That is, for any sets \(A\) and \(B\), the events \(\{X \in A\}\) and \(\{Y \in B\}\) are independent.

Independence is usually an assumption that we make based on our understanding of the random process in question, rather than something we try to prove.

The cumulative distribution function (CDF) of a random variable \(X\) is defined by \[ F(x) = P(X\leq x)\; \text{ for all }x\in\mathbb{R}.\] An implication of the weak inequality \(\leq\) is that the CDF specifies a probability of a left-open right-closed interval \((a,b]\): \[ P(X\in(a,b]) = P(a<X\leq b) = F(b)-F(a) .\] Since \((a,b]\subset\mathbb{R}\) is arbitrary, using combinations of such intervals, we can describe a probability that \(X\) lies in any (legitimate) subset. An important implication is that a random variable can be described by a CDF without referring to the underlying sample space and probability function.

We may plot the CDF of \(X_3\) as follows.

```
# par(mar=c(5.1, 4.1, 4.1, 2.1))
x <- 2:12
y1 <- c(1,3,6,10,15,21,26,30,33,35,36)/36
plot(x, y1, xlim=c(0,13), ylim=c(0,1), ylab="F(x)", pch=16)
par(new=TRUE)
y2 <- c(0,y1[1:10])
plot(x, y2, xlim=c(0,13), ylim=c(0,1), ylab="F(x)")
segments(x[1:10], y1[1:10], x[2:11], y1[1:10])
segments(c(2,12), c(0,1), c(-1,14), c(0,1))
segments(x, y2, x, y1, lty=3)
title(expression("The CDF of X"[3]))
```

By definition, a CDF is always right-continuous, so continuity at \(x\) is equivalent to \(F(x) = F(x-)\).

A random variable \(X\) is **discrete** if there exists a finite or countably infinite set \(B\subset\mathbb{R}\) such that \(P(X \in B) = 1\).

The definition implies that the CDF of \(X\) is a step function that jumps at each element of \(B\). For example, as seen in the above figure, the CDF of \(X_3\) has 11 jumps and is flat for the rest.

The **probability mass function** (PMF) of a discrete random variable \(X\) is the function \(p\) defined by \[p(x) = P(X=x)\] for all \(x\in\mathbb{R}\). The PMF is related to the CDF as follow: \[p(x) = F(x)-F(x-).\] In other words, \(p(x)\) is equal to the height of jump at \(x\) and \(0\) otherwise.

If a non-negative function \(f\) on \(\mathbb{R}\) satisfies \[P(X\leq b) = F(b) = \int_{-\infty}^b f(x)dx\] for all \(b\in\mathbb{R}\), then \(f\) is the **probability density function** (PDF) of \(X\).

A random variable \(X\) is (absolutely) **continuous** if \(X\) has a PDF.

An implication is that the CDF of \(X\) is a continuous function and \(F(b) = F(b-)\) for all \(b\in\mathbb{R}\). Thus, \[ P(X=b) = F(b)-F(b-) = 0 .\]

As done above, we can define multiple random variables on the same sample space \(\Omega\) and probability function \(P\). We may put together \(X_1, X_2,\dots, X_d\) and call it a random vector \(\mathbf{X} = (X_1, X_2,\dots, X_d)\). Since each \(X_d\) takes a value \(x\in\mathbb{R}\), \(\mathbf{X}\) takes a vector \(\mathbf{x}\in\mathbb{R}^d\).

The concepts for random variables are naturally generalized to random vectors.

- The probability that \(\mathbf{X}\) lies in \(A\subset\mathbb{R}^d\) is \[P(\mathbf{X}\in A) = P(\{\omega\in\Omega : \mathbf{X}(\omega)\in A\}) .\]
- The
**joint**PMF for \(\mathbf{X}\) is \[p(\mathbf{x}) = P(\mathbf{X}=\mathbf{x}) = P(X_1=x_1,X_2=x_2,\dots,X_d=x_d).\] - A non-negative function \(f\) on \(\mathbb{R}^d\) is the
**joint**PDF for \(\mathbf{X}\) if \(f\) satisfies \[P(X_1\leq x_1,\dots,X_d\leq x_d) = \int_{-\infty}^{x_1}\dots\int_{-\infty}^{x_d}f(y_1,\dots,y_d)dy_1\dots dy_d.\]

Let \(p_\mathbf{X}\) be the joint PMF of \(\mathbf{X} = (X_1,\dots, X_d)\). Then, for \(k\leq d\), the marginal PMF of \(X_k\) is \[p_{X_k}(x) = \sum_{x_1,\dots,x_{k-1},x_{k+1},\dots,x_d} p_{\mathbf{X}}(x_1,\dots,x_{k-1},x,x_{k+1},\dots,x_d) .\] In words, it is obtained by summing the joint PMF over all possible values of the other random variables with fixed \(X_k=x\).

The marginal PDF of \(X_k\) is defined analogously: \[f_{X_k}(x) = \int_{-\infty}^{\infty}\dots\int_{-\infty}^{\infty}f(x_1,\dots,x_{k-1},x,x_{k+1},\dots,x_d)dx_1\dots dx_{k-1}dx_{k+1}\dots dx_d .\]

We can also marginalise to more than 1-dimensional PMF/PDF by summing/integrating over all possible values of the random variables except those of interest.

Let’s consider a discrete random variable \(X\), which takes a value \(x\) with probability of \(p(x)\). It makes sense to think about what value to expect from this random event. One intuitive value to expect is the average of all possible values, isn’t it? Moreover, since each possible value has its own associated probability, it is more reasonable to use the **weighted average** with \(p(x)\) as a weight for \(x\).

Mathematical expectation formalises this intuition. For our purpose, it suffices to see expectation as weighted average.

In these notes, when discussing expectation, we implicitly assume it exists.

The expectation \(\mathbb{E}X\) of a discrete random variable \(X\) is defined by \[\mathbb{E}X = \sum_x xp(x) .\] This is precisely the weighted average — each value \(x\) is weighted by \(p(x)\) and summed up while each weight \(p(x)\) sums to \(1\).

In the case of continuous \(X\), summation over a countable set becomes integral over an uncountable set. In addition, each weight is provided by the PDF \(f(x)\), as it rightly “sums” (well, integrates) to \(1\). Hence, \[\mathbb{E}X = \int_{-\infty}^\infty xf(x)dx .\]

When there is no transformation of \(X\), \(\mathbb{E}X\) is called **mean** of \(X\).

We are often interested in \(h(X)\), a transformation of \(X\) using a deterministic function \(g\), and its expectation \(\mathbb{E}h(X)\).

Here is the formula. \[ \mathbb{E}h(X) = \begin{cases} \sum_x h(x)p(x), & \text{if } X \text{ is discrete}\\ \int_{-\infty}^\infty h(x)f(x)dx, & \text{if } X \text{ is continuous} \end{cases} \]

Remember the definition of a random variable — a function mapping \(\omega\in\Omega\) to \(x\in\mathbb{R}\). Since a function of a function is just another function, \(h(X)\) can be seen as another random variable (say, \(Y=h(X)\)). Finally, the probability of \(h(X)=y\) is obtained by summing the probabilities of \(X=x_i\) such that \(h(x_i)=y\) for all \(i\) because two sets \(\{\omega\in\Omega : X(\omega)=x_i\}\) and \(\{\omega\in\Omega : X(\omega)=x_j\}\) mutually exclusive.

The last point is seen at the second equality in the following proof for a discrete random variable. \[\begin{align} \mathbb{E}Y &= \sum_y y P(Y=y)\\ &= \sum_y y \sum_{x:y=h(x)} p(x)\\ &= \sum_y \sum_{x:y=h(x)} h(x)p(x)\\ &= \sum_x h(x)p(x) \end{align}\]

Let’s examine two cases of random vectors using \(\mathbf{X} = (X_1,X_2)\) defined above example. Since rolling each die is independent experiment, we know the joint PMF is 1/36 for all permutations.

```
sum1 <- 0
sum2 <- 0
for (x1 in 1:6) {
for (x2 in 1:6) {
sum1 <- sum1 + x1+x2
sum2 <- sum2 + x1*x2
}
}
```

First, \(h_1(x_1,x_2) = x_1+x_2\). \[\mathbb{E}h_1(\mathbf{X}) = \mathbb{E}h_1(X_1,X_2) = \mathbb{E}(X_1+X_2) = \sum_{x_1,x_2}(x_1+x_2)\frac{1}{36} = \frac{252}{36} = 7 .\]

Second, \(h_2(x_1,x_2) = x_1x_2\). \[\mathbb{E}h_2(\mathbf{X}) = \mathbb{E}h_2(X_1,X_2) = \mathbb{E}(X_1X_2) = \sum_{x_1,x_2}x_1x_2\frac{1}{36} = \frac{441}{36} = 12.25 \]

Due to the linearity of summation and integration, expectation is linear: \[\mathbb{E}(aX+bY) = a\mathbb{E}X + b\mathbb{E}Y .\]

Thus, we can also do the following to compute \(\mathbb{E}h_1(X_1,X_2)\): \[\mathbb{E}h_1(X_1,X_2) = \mathbb{E}(X_1+X_2) = \mathbb{E}X_1+\mathbb{E}X_2 = \sum_{i=1}^6 x_i\frac{1}{6} + \sum_{i=j}^6 x_j\frac{1}{6} = 7 .\]

When a transformation \(h\) takes a form \(h : x \mapsto (x-\mathbb{E}X)^2\), \(\mathbb{E}h(X)\) is called **variance** of \(X\). We denote \(\mathbb{E}(X-\mathbb{E}X)^2\) by \(\operatorname{Var} X\).

N.B. \(\mathbb{E}X\) is a constant value and no longer random.

The square root of the variance is the **standard deviation**.

Both variance and standard deviation are measures of spread. Variance is easier to calculate, but standard deviation is easier to interpret, as it has the same unit of measurement as the random variable.

The covariance \(\operatorname{Cov}(X,Y)\) of two random variables \(X\) and \(Y\) is defined by \[\operatorname{Cov}(X,Y) = \mathbb{E}(X-\mathbb{E}X)(Y-\mathbb{E}Y) .\] The covariance is a measure of how \(X\) and \(Y\) “co-vary” together.

Although most formulas associated with variance and covariance are straightforward to derive from their definitions, here is an important one worthy of our attention — the independence between \(X\) and \(Y\) implies \(\operatorname{Cov}(X,Y)=0\). \[\begin{align} \operatorname{Cov}(X,Y) &= \mathbb{E}(X-\mathbb{E}X)(Y-\mathbb{E}Y)\\ &= \mathbb{E}(XY - X\mathbb{E}Y - \mathbb{E}(X)Y + \mathbb{E}X\mathbb{E}Y )\\ &= \mathbb{E}(XY) - \mathbb{E}X\mathbb{E}Y - \mathbb{E}X\mathbb{E}Y + \mathbb{E}X\mathbb{E}Y\\ &= \mathbb{E}X\mathbb{E}Y - \mathbb{E}X\mathbb{E}Y\\ &= 0, \end{align}\] where we use the independence for \(\mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y)\).

When \(0<\operatorname{Var} X<\infty\), \(0<\operatorname{Var} Y<\infty\), and \(\operatorname{Cov}(X,Y)\) is finite, similar to standard deviation, we can normalize the covariance and obtain the correlation coefficient: \[\operatorname{Corr}(X,Y) = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var} X}\sqrt{\operatorname{Var} Y}} .\] The correlation coefficient has the following properties:

- \(-1\leq \operatorname{Corr}(X,Y)\leq 1\)
- \(\operatorname{Corr}(X,Y)=1\) if and only if there exist \(a>0\) and \(b\in\mathbb{R}\) such that \(Y=aX+b\).
- \(\operatorname{Corr}(X,Y)=-1\) if and only if there exist \(a<0\) and \(b\in\mathbb{R}\) such that \(Y=aX+b\).

Let \(\{X_1,X_2,\dots,X_n\}\) be IID random variables with finite mean \(\mu = \mathbb{E}X_k\) and finite variance \(\operatorname{Var} X_k\). Also, let the sample average denoted by \(\overline{X} = (X_1+\dots+X_n)/n\). Then, for any fixed \(\epsilon>0\), we have a theorem called the weak law of large number: \[ \lim_{n\to\infty} P\left( \left| \overline{X} - \mu \right| > \epsilon \right) = 0 .\]

An interpretation is that the sample average, in some sense, converges to the population mean.

It is “weak” because of the type of convergence, namely “convergence in probability”, which is silent on whether \(\overline{X}\) itself converges to \(\mu\).

The theorem provides the basis and allows us to justify arguments based on stochastic simulations. Specifically, we run many stochastic simulations, each of which has realisations of random variables of interest. We then average the realizations to make arguments on their population means.