Skip to content

Probability theory in a nutshell

This is not meant to teach anyone probability theory. It is meant for a quick and dirty reference when required. I have followed the convention that anything with a subscript of \( n \) implies that the index set is \( ℕ \).

Basic Theory

Disclaimer: Many of the ideas and examples have been taken from Manjunath Krishnapur's probability theory notes, which I would highly recommend for a thorough understanding of the subject.

Discrete probability spaces (countable outcomes)

Let \( Ω \) be a countable set and \( 𝕡: Ω → [0, 1] \) such that \( ∑_{ω ∈ Ω} 𝕡(ω) = 1 \). Then we call \( (Ω, 𝕡) \) a discrete probability space. For any \( E ⊆ Ω \), define the probability of \( E \) as \( ℙ(E) = ∑_{ω ∈ E} 𝕡(ω) \).

probability of selecting a prime

Draw a random integer from 1 to 100. What is the chance that it is a prime number? Here \( Ω = \bcrl{1, 2 , …, 100}, E = \bcrl{2, 3, …, 97} \), so \( ℙ(E) = \frac14 \).

!!! "Moral" * Simple to set up the theory. * Actual computations may still be difficult.

Continuous probability spaces (uncountable outcomes)

Problems

  • Example. Breaking a stick at random: Here \( Ω = [0, 1] \). But clearly "\( ℙ[0.25, 0.5] = ∑_{ω ∈ [0.25, 0.5]} 𝕡(ω) \)" makes no sense! (Singletons have zero probability, so adding uncountable zeros to get a positive number sounds weird.)
  • Example. Toss a fair coin infinitely many times: Here \( Ω = \{ 0, 1 \}^ℕ \) (uncountable). Let \( E \) be the event that the first two tosses are heads. Clearly, \( ℙ(E) = 2^{-2} \). But how do we sum up (uncountably many) zeros to get this number?
  • Example. Throw a dart at a dart-board: Same as the “breaking a stick” example, but in two dimensions.

Solution: use measure theory

  1. Abandon the idea that every subset of the sample space can be assigned a probability (there exists non-measurable sets).
  2. Assume probabilities of certain (simple) events, and compute probabilities of more complicated events using them (start with a probability measure on an algebra, and use the Carathéodory extension theorem to extend it to a σ-algebra containing the algebra).

Measure-theoretic probability

Probability space: A triple \( (Ω, ℱ, ℙ) \), where

  1. \( Ω \) is a set containing the elementary outcomes.
  2. \( ℱ ⊆ 2^Ω \) is a σ-algebra on \( Ω \), i.e. i. \( ∅ ∈ ℱ \), ii. \( E ∈ ℱ ⟹ E^∁ ∈ ℱ \), and iii. \( (E_n)_{n ∈ ℕ} ⊂ ℱ ⟹ ⋃ E_n ∈ ℱ \).

  3. \( ℙ: ℱ → [0, 1] \) is the probability measure on the measurable space \( (Ω, ℱ) \), i.e.

    i. \( ℙ(∅) = 0 \), ii. (σ-additivity) If \( (E_n)_{n ∈ ℕ} ⊂ ℱ \) are a disjoint sequence of sets in \( ℱ \), then \( ℙ(⨆ E_n) = ∑ P(E_n) \), and iii. (probability measure) \( ℙ(Ω) = 1 \).

σ-algebras

  • Elements of \( ℱ \) are called events.
  • A σ-algebra contains all subsets of \( Ω \) that are measurable. Essentially, these are the events to which we can assign a probability in a meaningful way. Thus, in probability theory, we understand the σ-algebra to contain "information" about the system. The finer the σ-algebra, the more information we have.
  • (Embedding discrete probability spaces within the framework) Since all events are measurable in a discrete probability space, we model it as \( (Ω, 2^Ω, ℙ) \), where we define \( ℙ(E) = ∑_{ω ∈ E} 𝕡(ω) \) for any \( E ⊆ Ω \).
  • An increasing sequence of σ-algebras (\( (ℱ_n) \) such that \( ℱ_n ⊆ ℱ_{n+1} ∀n \)) is called a filtration, and the quadruple \( (Ω, ℱ, (ℱ_n), ℙ) \) is called a filtered probability space. Filtered probability spaces are used to model systems that evolve in time.
  • The σ-algebra generated by a class of events \( ℰ \), denoted \( σ(ℰ) \), is the smallest σ-algebra containing \( ℰ \). It is easy to see that \( σ(ℰ) = ⋂ \{ ℱ : ℱ \text{ is a σ-algebra}, ℱ ⊇ ℰ \} \).
  • An event \( E \) is said to happen almost surely (denoted "\( E \) a.s.") if \( ℙ(E^∁) = 0 \).

Random variables

  • A random variable ("RV") is a measurable function \( X: (Ω, ℱ, ℙ) → (\bar{Ω}, \bar{ℱ}) \), i.e., \( ∀ \bar{E} ∈ \bar{ℱ}, X^{-1}(\bar{E}) ∈ ℱ \). The most common example of \( (\bar{Ω}, \bar{ℱ}) \) is \( (ℝ, ℬ) \) (and their higher finite-dimensional equivalents), where \( ℬ \), called the Borel σ-algebra on \( ℝ \) is the σ-algebra generated by the open sets (or equivalently closed and half-open sets). From now on, we will assume \( (\bar{Ω}, \bar{ℱ}) = (ℝ, ℬ) \).
  • A RV \( X \) is called integrable, or \( X ∈ L^1(Ω, ℱ, ℙ) \), if \( ∫|X| dℙ < ∞ \). We denote \( 𝔼(X) = ∫X dℙ = ∫X(ω) ℙ(dω) \) and call it the expectation of \( X \). If \( X ∈ L^2(Ω, ℱ, ℙ) \), then we denote \( 𝕍(X) = 𝔼((X - 𝔼(X))^2) \) and call it the variance of \( X \).
  • The σ-algebra generated by a RV \( X \), denoted \( σ(X) \), is the smallest σ-algebra that makes \( X \) measurable. Again, it can be shown that \( σ(X) = X^{-1}(ℬ) \).
  • The Pushforward of a measure w.r.t. a RV Let \( X: (Ω, ℱ, ℙ) → (\bar{Ω}, \bar{ℱ}) \). Then \( ℙ_X := ℙ ∘ X^{-1} \) is a measure on \( (\bar{Ω}, \bar{ℱ}) \).
  • The distribution of a RV \( X \) is defined by \( F_X(x) = ℙ\{ X ≤ x \} = ℙ\{ X ∈ (-∞, x] \} = ℙ_X(-∞, x] \).
  • If \( ℙ_X ≪ λ \) (\( ℙ_X \) is absolutely continuous w.r.t. the Lebesgue measure \( λ \)), then the density of a RV \( X \) is defined by the Radon-Nikodym derivative \( f_X = \frac{d ℙ_X}{d λ} = \frac{d ℙ_X}{d x} \). In layman's terms, \( f_X = \frac{d F_X}{d x} \) (when the derivative exists). From now on, whenever we write \( f_X \), we assume that it exists.
  • Theorem: \( 𝔼(ϕ(X)) = ∫_ℝ ϕ(x) d ℙ_X(d x) \).
  • Examples of events in terms of RVs \( \{ X ∈ (a, b] \} = \{ ω ∈ Ω : X(ω) ∈ (a, b] \} = ℙ_X(a, b] = F_X(b) - F_X(a) = ∫_a^b f_X(x) dx \).

Independence

  • A sequence of σ-algebras \( (ℱ_n) \) of \( Ω \) are called (mutually) independent if \( ∀ E_i ∈ ℱ_i, i ∈ \{1, 2, \dots, n \}, n ∈ ℕ \), we have \( ℙ(⋂_{i = 1}^n E_i) = ∏_{i = 1}^n ℙ(E_i) \).
  • A sequence of events are called independent if the σ-algebras generated by them are independent.
  • A sequence of RVs are called independent if the σ-algebras generated by them are independent.
  • A sequence of RVs are called independent and identically distributed ("IID"), if they have the same measure and are independent.

Some common probability measures on ℝ

Discrete

  • \( \text{Binomial}(n, p) \): \( 𝕡(k) = \binom{n}{k} p^k (1 - p)^{n - k} \).
  • \( \text{Poisson}(λ) \), \( λ ≥ 0 \): \( 𝕡(k) = e^{-λ} \frac{λ^k}{k!} \).

Continuous

  • \( \text{Uniform}(a, b) \): same as the scaled Lebesgue measure, i.e. \( \frac{d u}{d x} = \frac{1}{b - a} \).
  • \( \text{Gaussian} \ 𝒩(μ, σ) \): \( \frac{d γ}{d x} = \frac{1}{\sqrt{2 π σ^2}} \exp \left( -\frac{(x - μ)^2}{2 σ^2} \right) \).
  • \( \text{Exponential}(λ) \): \( \frac{d η}{d x}(x) = λ e^{-λx} 𝟙_{x ≥ 0}(x) \).

The Borel-Cantelli Lemmas

Let \( (E_n) ⊂ ℱ \) be a sequence of events. We define the following tail events

  • \( \liminf_{n → ∞} E_n = ⋃_{n ∈ ℕ} ⋂_{m ≥ n} E_m = \{ E_n \text{ ev} \} ∈ ℱ \), where ev = eventually, and
  • \( \limsup_{n → ∞} E_n = ⋂_{n ∈ ℕ} ⋃_{m ≥ n} E_m = \{ E_n \text{ i.o.} \} ∈ ℱ \), where i.o. = infinitely often.
  • By De Morgan's laws, \( \{ E_n \text{ i.o.} \}^∁ = \{ E_n^∁ \text{ ev} \} \).

Borel-Cantelli Lemmas

  • (BC1) If \( ∑ ℙ(E_n) < ∞ \), then \( ℙ\{ E_n \text{ i.o.} \} = 0 \).
  • (BC2) If \( (E_n) \) are independent and \( ∑ ℙ(E_n) = ∞ \), then \( ℙ\{ E_n \text{ i.o.} \} = 1 \).
  • Example. (Infinite Monkey Theorem) Shakespeare's complete works consist of a total of [\( 884,421 \)][www.opensourceshakespeare.org/stats/] words. Assume that the average English word length is 5.1 characters. So total number of characters = \( 4510547 \). Let a monkey be typing on a keyboard randomly (independent keystrokes). The keyboard has \( 30 \) characters, and the event that in the n\text{th} \( 4510547 \) characters replicate Shakespeare's works is denoted \( E_n \). Clearly, \( ℙ(E_n) = 30^{-4510547} \), which is a constant. Then \( ∑ ℙ(E_n) = ∞ \), and by BC1, \( ℙ\{ E_n \text{ i.o.} \} = 1 \). That is, the monkey will replicate Shakespeare's works infinitely often!

Modes of convergence

  • Sure convergence or pointwise convergence (pointless!)
  • Complete convergence
  • A.s. convergence
  • \( L^p \) convergence
  • Convergence in probability
  • Weak* convergence, or convergence in distribution
  • Vague convergence

Important theorems

  • Markov inequality.
  • Borel-Cantelli Lemmas (see above).
  • (Laws of Large Numbers) Let \( (X_n) \) be a sequence of IID RVs, and \( S_n = ∑_{i = 1}^n X_i \). Then

    • (WLLN) \( S_n → 𝔼(X_1) \) in probability as \( n → ∞ \).
    • (SLLN) \( S_n → 𝔼(X_1) \) a.s. as \( n → ∞ \).
  • Example. of WLLN: Bernstein polynomials uniformly approximate continuous functions (probabilistic proof of the Weierstrass Approximation Theorem)

  • (Kolmogorov's 0-1 Law): Let \( (X_n) \) be a sequence of independent RVs. If \( E_T \) is a tail event \( (E_T ∈ ℱ_T = ⋂_{n ∈ ℕ} σ(X_n)) \), then \( ℙ(E_T) = 0 \) or \( ℙ(E_T) = 1 \).

  • (Central Limit Theorem) Let \( (X_n) \) be a sequence of IID RVs with \( 𝔼(X_1) = μ, 𝕍(X_1) = σ^2 \), and let \( S_n = ∑_{i = 1}^n X_i \). Then \( \sqrt{n} \frac{S_n - μ}{σ} → N(0, 1) \) in distribution as \( n → ∞ \).

Conditioning

  • Motivation: \( ℙ(B | A) = \frac{ℙ(B ∩ A)}{ℙ(A)} \). But \( ℙ(A) \) may be zero!
  • (Conditional expectation) Let \( (Ω, ℱ, ℙ) \) a complete probability space (complete means that all sets of measure \( 0 \) are in \( ℱ \)), \( X ∈ L_+^1(Ω, ℱ, ℙ) \) be a positive integrable random variable and \( 𝒢 ⊆ ℱ \) be a sub σ−algebra. On \( 𝒢 \), we define the measure induced by \( X \) as \( ℚ(A) = 𝔼(X 𝟙_A) \ ∀ A ∈ 𝒢 \). Then \( ℚ ≪ ℙ \), and so by Radon-Nikodym’s theorem, there exists (a.s.) a \( 𝒢 \)-measurable function \( Y \) such that \( 𝔼(Y 𝟙_A) = 𝔼(X 𝟙_A) \ ∀ A ∈ 𝒢 \). We denote \( 𝔼(X | 𝒢) = Y \). The general case \( X ∈ L_+^1(Ω, ℱ, ℙ) \) can be handled by writing \( X = X_+ - X_- \).
  • (Conditional probability) Define \( ℙ(E | 𝒢) = 𝔼(𝟙_E | 𝒢) \).
  • Note: the conditional expectation (and hence the conditional probability) is a RV. Heuristically, it is the RV “closest” to the original RV. In this sense, the conditional expectation is like a projection. This can be seen by the property: \( 𝔼(𝔼(X | 𝒢) | 𝒢) = 𝔼(X | 𝒢) \). In fact, if \( X ∈ L^2(Ω, ℱ, ℙ) \), then \( 𝔼(X | 𝒢) \) is indeed the orthogonal projection onto the subspace \( L^2(Ω, 𝒢, ℙ) \).

Stochastic processes

  • A set of RVs, indexed by an ordered set (Example. \( ℕ, ℝ \)), is called a stochastic process.

  • Martingale: Let \( (Ω, ℱ, (ℱ_n), ℙ) \) is called a filtered probability space, and \( (X_n) \) be a stochastic processes such that \( X_n \) is \( ℱ_n \)-measurable \( ∀n \). Then the stochastic processes \( (X_n) \) is called a martingale if \( ∀n, X_n ∈ L^1 \) and \( 𝔼(X_{n+1} | ℱ_n) = X_n \).


Last update: 2020-07-21 17:41:22

Comments