Monte Carlo Markov Chain

16 minute read

Published:

Monte Carlo Markov Chain

1. Motivation

Statistical inference consists in learning about what we do not observe based on what we observe.

1.1. Monte Carlo Method

In its general form, Monte Carlo estimates the following expectation:

\[\begin{aligned}\int_{x} P(x) f(x) d x & \approx \frac{1}{S} \sum_{1 \leq s \leq S} f\left(x^{(s)}\right) \quad \text{(Monte Carlo)}\\x^{(s)} & \sim P(x)\end{aligned}\]

The only tricky issue is that the randomness involved is the pseudo-randomness of computer simulation, rather than randomness of real-world phenomena. Thus it is a good idea to use terminology that emphasizes the difference.

1.2. Bayesian Inference

Bayes rule:

\[\begin{aligned}P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\\ P(\theta|D) = \frac{P(D|\theta)P(\theta)}{\int_{\theta} P(D|\theta)P(\theta)} \end{aligned}\quad \text{(Posterior Distribution)}\]
The problem here is the integral in denominator, we can choose prior and likelihood distribution such that they are both conjugate. With conjugate distributions, we can come up with marginal likelihood distribution in closed form and consequently, the posterior can be computed. However, in most cases, the model is too complicated ⇒ intractable to compute the marginal likelihood in high dimensions ⇒ approximating $\int_{\theta} P(D\theta)P(\theta)$ with Monte Carlo and $\theta^{(i)} \sim P(\theta)$:
\[\int_{\theta} P(D|\theta)P(\theta) \simeq \frac{1}{n}\sum_{1\leq i \leq n}P(D|\theta^{(i)}) \quad \text{(Marginal Likelihood Distribution)}\]

After approximating the posterior distribution of parameters

We can predict the probability of observing a new sample $x’$ by marginalizing over $\theta$:

\[P\left(x^{\prime} \mid D\right)=\int_{\theta} P(\theta \mid D) P\left(x^{\prime} \mid \theta\right) d \theta \quad \text{(Posterior Predictive Distribution)}\]

This distribution can be known as marginal posterior distribution.

In cases such as when the model is simple and conjugate priors are being used the posterior and the above integral can be solved analytically.

Therefore, we can use Monte Carlo Method to approximate above probability by taking samples $\theta^{(i)}$ from posterior distribution which is: $\theta^{(i)} \sim P\left(\theta \mid D \right)$

\[\int_{\theta} P(\theta \mid D) P\left(x^{\prime} \mid \theta\right) d \theta \simeq \frac{1}{n} \sum_{1 \leq i \leq n} P\left(x^{\prime} \mid \theta^{(i)}\right)\]

1.3. Markov Chains

A sequence $x_1, x_2, \ldots$ of random elements of some set is a Markov chain if the conditional distribution of $x_{n+1}$ given $x_1,\ldots, x_n$ depends on $x_n$ only. The set in which the $x_i$ take values is called the state space of the Markov chain. A Markov chain has stationary transition probabilities if the conditional distribution of $x_{n+1}$ given $x_n$ does not depend on $n$. This is the main kind of Markov chain of interest in MCMC. Some kinds of adaptive MCMC have non-stationary transition probabilities.

The joint distribution of a Markov chain is determined by:

  • The marginal distribution of $x_1$, called the initial distribution $\pi^{(1)}(x)$
  • The conditional distribution $P(x_{n+1}x_n)$ called the transition probability distribution
\[P(x_{n+1}|x_n, x_{n-1}, \ldots,x_1) = P(x_{n+1}|x_n)\]
If we define $x_i \in \Omega$ which is our state space and $\Omega$ is discrete state space + cardinality of $\Omega$ $(\Omega=N)$ then we can totally define transition probability distribution as stochastic transition matrix as follow
\[A^{(n)} = \begin{pmatrix} P(x_{n+1}=\Omega_1|x_n=\Omega_1) & \ldots & P(x_{n+1}=\Omega_N|x_n=\Omega_1)\\ \vdots & \ddots & \vdots \\ P(x_{n+1}=\Omega_1|x_n=\Omega_N) & \ldots & P(x_{n+1}=\Omega_N|x_n=\Omega_N) \end{pmatrix}\]
Realize that $A^{(n)}{ij} = P(x{n+1}=\Omega_jx_n=\Omega_i)$ and $\sum_{j=1}^{N} A^{(n)}_{ij}=1$ for all row $i^{th}$.

The Markov chain might have a stationary distribution $\pi^{*}(x)$:

\[\pi^{*}_{j}=\lim _{n \rightarrow \infty} P\left(x_{n}=\Omega_j \mid x_{0}=\Omega_i\right)\]

Thanks to Perron-Frobenius theorem which is stated as follow:

Let $A$ be a real matrix, with all its elements positive $A_{ij}$ > 0. Then the top eigenvalue $\lambda_{\max}$ is unique and real (all other eigenvalues have a smaller real part). The corresponding top eigenvector $\mathbf{v}^{*}$ has all its elements positive:

${A \mathbf { v } ^ { * }}=\lambda_{\max } \mathbf{v}^{} ; \mathbf{v}_{k}^{}>0, \forall k$ then the top eigenvalue satisfies the following inequalities:

\[\min _{i} \sum_{j} A_{i j} \leq \lambda_{\max } \leq \max _{i} \sum_{j} A_{i j} .\]

The transition matrix has eigen-value of 1 which can help us to find stationary probability distribution of Markov states $\pi^{}(x)$. Specifically, If $A^{(n)}$ is a stochastic matrix of the Markov chain, we must have $\sum_{j=1}^{N} A^{(n)}_{ij}=1$ which makes $\lambda_{\max}=1$. Then let’s $P^{}_j = P^{}(x=\Omega_j) = \mathbf{v}^{}_j$ which is our stationary distribution:

\[P_{j}^{*}=\sum_{i} A_{i j} P_{i}^{*} \longrightarrow P_{j}^{*}=\frac{\mathbf{v}_{j}^{*}}{\sum_{k} \mathbf{v}_{k}^{*}}\]

or with $\pi^{(k)}(x) = \pi^{(0)}(x)A^{(k)}$

In order to generalize the definition transition operator we denotes:

\[T_n(x_n\rightarrow x_{n+1}) = P(x_{n+1}|x_n)\]

In MCMC, the Markov chain is usually a homogeneous Markov Chain which has transition functions are the same for all $n$

\[T(x' \rightarrow x) = T_n(x_n\rightarrow x_{n+1}) = T_{n+1}(x_{n+1}\rightarrow x_{n+2})\]

If we use the transition operator, the definition of stationary (invariant) distribution becomes:

\[\begin{equation}\pi(x)=\sum_{x^{\prime}} \pi\left(x^{\prime}\right) T\left(x^{\prime} \rightarrow x\right) \end{equation}\]

1.4. Properties of Markov Chains

To engineer a Markov chain we need some useful asymptotic properties of Markov chain

Definition of irreducible Markov chain:

We say that a Markov chain is irreducible if, for each $i$ and $j$, there exists a $k$ (possibly depends on $i$ and $j$) such that:

\[P(x_{n+k}=\Omega_i\mid x_{n}=\Omega_j) > 0 \quad \text{for finite k}\]

verbally, a chain is irreducible if it is possible to eventually get from any state $i$ to any other state $j$ in finite number of steps.

Definition of aperiodic Markov chain:

An irreducible Markov chain is called aperiodic if its period is equal to 1, or equivalently

Theorem: Irreducible Markov chain has at least one non-null persistent $\Leftrightarrow$ Markov chain has a (unique) stationary distribution $\pi(x)$ [7,8,9]

Theorem: A finite (at least one state is non-null persistent [9]), irreducible Markov chain has a unique stationary distribution $\pi(x)$ [4]

Theorem: A finite, irreducible, aperiodic Markov chain has limiting distribution equals unique stationary distribution $\pi^{*}(x)=\pi(x)$ [6]

Theorem: A non-null persistent, irreducible, aperiodic Markov chain has limiting distribution equals unique stationary distribution $\pi^{*}(x) = \pi(x)$

2. How To Sampling?

Since the only problem here is how to do sampling $\theta^{(i)} \sim P(\theta)$ or $\theta^{(i)} \sim P\left(\theta \mid D \right)$. There are some standard methods and other intelligent, efficient methods.

2.1. Inverse Transform the Uniform Sampling

Inverse transform sampling is straightforward method of generating random samples from any distribution by using its inverse cumulative distribution function (ICDF) $F^{-1}(x)$. Easy to note that

\[F_{X}(x) = P(X\leq x) \in [0,1] \quad\]

Source: [https://en.wikipedia.org/wiki/Inverse_transform_sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling)

Source: https://en.wikipedia.org/wiki/Inverse_transform_sampling

By sampling uniform distributed random variable $U$as a value of cumulative distribution function and inverse back that value, we can get the desired sample $x$. Here is the proof that inverse uniform random variable will have the same CDF with desired distribution. Let’s $U \sim \text{Uniform}(0,1)$, $F^{-1}(u) = \inf{xF(x) \geq u}$ and $P(U\leq u) = u$
\[\begin{aligned}&P\left(F^{-1}(U) \leq x\right) \\&=P(U \leq F(x)) \\&=F(x)\end{aligned}\]

The intuition behind Inverse Transform Sampling is simple: re-scale the a uniform random variable to have probability that we want.

2.2. Acceptance-Rejection Sampling

Rejection sampling used a proxy proposal distribution $G$ with density function of $g(x)$ to generate samples $x$ and with an acceptance-rejection criterion to filter out unwanted sample to get new sample with distribution as our desired distribution $F$ with density function of $f(x)$. This algorithms is described as follows:

  1. Sample an observation $x$ from $G$ and an uniform distributed sample $u \sim \text{Uniform}\left(0,1\right)$
  2. Check whether $u \leq \frac{f(x)}{cg(x)}$. If the it’s true than accept this sample as our sample for distribution $F$, else, reject it and return back to step 1.

The algorithm will take average $c$ iterations to get one sample for $F$

Proof

We need to show that conditioned probability $P\left(X\leq x \mid U \leq \frac{f(X)}{cg(X)}\right) = F(x)$. Firstly we compute reverse conditional probability

\[\begin{equation}\begin{aligned} P\left(U \leq \frac{f(X)}{c g(X)} \mid X \leq x\right) &=\frac{P\left(U \leq \frac{f(X)}{c g(X)}, X \leq x\right)}{G(x)} \\ &=\int_{-\infty}^{x} \frac{P\left(U \leq \frac{f(x)}{c g(x)} \mid X=w \leq x\right)}{G(x)} g(w) d w \\ &=\frac{1}{G(x)} \int_{-\infty}^{x} \frac{f(w)}{c g(w)} g(w) d w \\ &=\frac{1}{c G(x)} \int_{-\infty}^{y} f(w) d w \\ &=\frac{F(x)}{c G(x)} \end{aligned}\end{equation}\]

Using Bayes theorem with the fact that $P\left(U \leq \frac{f(X)}{c g(X)}\right) = \int_{x} P\left(U\leq \frac{f(x)}{cg(x)}\mid X=x\right) dx = \int_{x} \frac{f(x)}{cg(x)} \times g(x) dx = \frac{1}{c} \int_{x} f(x)dx = \frac{1}{c}$

\[P\left(X\leq x \mid U \leq \frac{f(X)}{cg(X)}\right) = \frac{P\left(U \leq \frac{f(X)}{c g(X)} \mid X \leq x\right)P(X\leq x)}{P\left(U \leq \frac{f(X)}{c g(X)}\right)} = \frac{\frac{F(x)}{c G(x)} \times G(x)}{\frac{1}{c}} = F(x) \quad \blacksquare\]
def f(x):
    return 1.2 - x**4

def batch_sample(function, num_samples, xmin=0, xmax=1, ymax=1.2, batch=1000):
    samples = []
    while len(samples) < num_samples:
        x = np.random.uniform(low=xmin, high=xmax, size=batch)
        y = np.random.uniform(low=0, high=ymax, size=batch)
        samples += x[y < function(x)].tolist()
    return samples[:num_samples]

def sample(function, xmin=0, xmax=1, ymax=1.2):
    while True:
        x = np.random.uniform(low=xmin, high=xmax)
        y = np.random.uniform(low=0, high=ymax)
        if y < function(x):
            return x

def gauss(x):
    return np.exp(-np.pi * x**2)

def batch_sample_2(function, num_samples, xmin=-10, xmax=10, ymax=1):
    x = np.random.uniform(low=xmin, high=xmax, size=num_samples)
    y = np.random.uniform(low=0, high=ymax, size=num_samples)
    passed = (y < function(x)).astype(int)
    return x, y, passed
import matplotlib.pyplot as plt
import numpy as np

xs = np.linspace(0, 1, 1000)
ys = f(xs)

plt.plot(xs, ys, label="Function") 
plt.fill_between(xs, ys, 0, alpha=0.2)
plt.xlim(0, 1), plt.ylim(0, 1.25), plt.xlabel("x"), plt.ylabel("y"), plt.legend();

samps = [sample(f) for i in range(10000)]

plt.plot(xs, ys, label="Function")
plt.hist(samps, density=True, alpha=0.2, label="Sample distribution")
plt.xlim(0, 1), plt.ylim(0, 1.4), plt.xlabel("x"), plt.ylabel("y"), plt.legend();

xs = np.linspace(-10, 10, 1000)
ys = gauss(xs)

plt.plot(xs, ys)
plt.fill_between(xs, ys, 0, alpha=0.2)
plt.xlabel("x"), plt.ylabel("f(x)")
samps = batch_sample(f, 10000)

plt.plot(xs, ys, label="Function")
plt.hist(samps, density=True, alpha=0.2, label="Sample distribution")
plt.xlim(0, 1), plt.ylim(0, 1.4), plt.xlabel("x"), plt.ylabel("f(x)"), plt.legend(); ##REMOVE

%timeit [sample(f) for i in range(10000)]

%timeit batch_sample(f, 10000)

x, y, passed = batch_sample_2(gauss, 10000)

plt.plot(xs, ys)
plt.fill_between(xs, ys, 0, alpha=0.2)
plt.scatter(x, y, c=passed, cmap="RdYlGn", vmin=-0.1, vmax=1.1, lw=0, s=2)
plt.xlabel("x"), plt.ylabel("y"), plt.xlim(-10, 10), plt.ylim(0, 1);

print(f"Efficiency is only {passed.mean() * 100:0.1f}%")

2.3. Markov Chain Monte Carlo

We need a Markov chain to have the following conditions:

  1. Having the desired sampling distribution as Markov chain’s stationary distribution $P(x) = \pi(x)$. The first condition can be solved by constraining the transition operator on a condition called as detailed balance condition
  2. With any starting point, we can achieve the stationary distribution and visit all possible states.This condition can be known as Ergodicity.

These conditions guarantee that realizations of a single Markov chain to approximate the integrals without sampling from actual distribution $P(x)$. These conditions are proved by the following fundamental theorem

Fundamental Theorem To Design with Known Invariant Distribution $\pi(x)$ [1]

If a homogeneous Markov chain on a finite state space with transition operator $T$ has $\pi$ as an invariant distribution and

\[\begin{equation}\nu=\min _{x} \min _{x^{\prime}: \pi\left(x^{\prime}\right)>0} T\left(x \rightarrow x^{\prime}\right) / \pi\left(x^{\prime}\right)>0\end{equation}\]

then Markov chain is ergodic,

  1. if $f(x)$ is a real-valued function from the state space of the Markov chain, then the expectation of $f$ with respect to the distribution $P^{(n)}(x)$ written as $\mathbb{E}{n}[f]:=\sum{x} f(x) P^{(n)}(x)$ , converges to its expectation with respect to $\pi^{}$, written $\mathbb{E}[f]:=\sum_{x} f(x) \pi^{}(x)$

    \[\lim _{n \rightarrow \infty} \mathbb{E}_{n}[f]=\mathbb{E}[f] \quad \text{(Irreducible)}\]
  2. For all $x$, regardless of the initial distribution $P^{(1)}(x)$, we got limiting distribution is our unique invariant distribution

    \[\lim _{n \rightarrow \infty} P^{(n)}(x)=\pi(x) \quad \text{(Aperiodic)}\]

The detailed balance condition is stated as follow: a transition operator $T$ satisfies the detailed balance condition if there exists a distribution $\pi(x)$ such that (and, unique $\pi(x)$ is our stationary distribution $\pi^{*}(x)$)

\[\pi(x) T\left(x \rightarrow x^{\prime}\right)=\pi\left(x^{\prime}\right) T\left(x^{\prime} \rightarrow x\right)\]

Therefore, we only need to find the transition operator $T$ satisfies

\[\begin{equation}P(x) T\left(x \rightarrow x^{\prime}\right)=P\left(x^{\prime}\right) T\left(x^{\prime} \rightarrow x\right)\end{equation}\]

Remember this is our sufficient condition to design desired distribution to be stationary distribution since if the Markov stationary distribution is not unique one and stationary distribution is not limiting distribution, then detailed balance condition can not be useful. Fortunately, the ergodicity implies homogeneous Markov chain has unique stationary distribution (irreducible) and it’s also our limiting distribution (aperiodic).

2.4. Metropolis-Hastings Algorithm

Let $P(x)$ is our desired distribution to sample from. We wish to have $P(x)$ as our stationary distribution in MCMC. A proposal distribution will random walk point $x$ to other point $x’$ with probability of $q(x’|x)$ then Metropolis-Hastings algorithm will accept the new sample point if that point yields high unnormalized term $\propto P(x)$. Metropolis-Hastings:

  1. Sample $x^{(0)}$ from an arbitrary probability distribution
  2. for $n = 1,2, \ldots, N$ do
    1. Propose a random walk $x’ \sim q(x’x^{(n)}) = q(x^{(n)}\rightarrow x’)$
    2. Acceptance rate of new point $x’$ is $a = A(x^{(n)}\rightarrow x’) = \min \left( 1, \frac{P(x’)q(x’\rightarrow x^{(n)})}{P(x^{(n)})q(x^{(n)}\rightarrow x’)}\right)$
    3. Sample $u \sim \text{Uniform}(0,1)$
    4. If $u \leq a$ then assign new sample point by a random walk point $x^{(n+1)} = x’$ else remains the old point $x^{(n+1)} = x^{(n)}$
  3. Collect $N$ points $x^{(n)}$ to create a $N-\text{sample}$ of the distribution $P(x)$

We now replace $x^{(n)}$ with $x$ for easy manipulation with general original point. Notes that with the term $\min \left( 1, \frac{P(x’)q(x’\rightarrow x)}{P(x)q(x\rightarrow x’)}\right)$, we do not necessarily compute exactly the density which is computationally intractable in the first place, therefore we replace it with unnormalized term without any difference due to the cancel between numerator and denumerator

$P(x) = Z_p \tilde{P}(x)$.

\[A (x\rightarrow x') = \min \left( 1, \frac{P(x')q(x'\rightarrow x)}{P(x)q(x\rightarrow x')}\right) = \min \left(1, \frac{Z_p\tilde{P}(x')q(x'\rightarrow x)}{Z_p\tilde{P}(x)q(x\rightarrow x')} \right) = \min \left( 1, \frac{\tilde{P}(x')q(x'\rightarrow x)}{\tilde{P}(x)q(x\rightarrow x')}\right)\]

We will prove that Metropolis-Hastings satisfies ergodicity conditions and detailed balance condition with desired probability $P(x)$ is our stationary distribution. Firstly, the transition function of Metropolis-Hastings is

\[T\left(x \rightarrow x^{\prime}\right)= \begin{cases}q\left(x \rightarrow x^{\prime}\right) A\left(x \rightarrow x^{\prime}\right) & \text { if } x \neq x^{\prime} \\ q(x \rightarrow x)+\sum_{x^{\prime}, x^{\prime} \neq x} q\left(x \rightarrow x^{\prime}\right)\left(1-A\left(x \rightarrow x^{\prime}\right)\right) & \text { if } x=x^{\prime}\end{cases}\]

Since with the case of $x = x’$ , there are two possible cases which are the proposal probability give new point equals exactly $x$ with $q(x\rightarrow x)$ and $a = 1$. Meanwhile, the proposal distribution give new point that is different from the original one $q(x\rightarrow x’)$ but is rejected with probability of $1-A\left(x \rightarrow x^{\prime}\right)$. For $x’ \neq x$ we have

\[\begin{aligned} P(x) T\left(x \rightarrow x^{\prime}\right) &=P(x) q\left(x \rightarrow x^{\prime}\right) \min \left(1, \frac{P\left(x^{\prime}\right) q\left(x^{\prime} \rightarrow x\right)}{P(x) q\left(x \rightarrow x^{\prime}\right)}\right) \\ &=\min \left(P(x) q\left(x \rightarrow x^{\prime}\right), \pi\left(x^{\prime}\right) q\left(x^{\prime} \rightarrow x\right)\right) \\ &=P\left(x^{\prime}\right) q\left(x^{\prime} \rightarrow x\right) \min \left(1, \frac{P(x) q\left(x \rightarrow x^{\prime}\right)}{P\left(x^{\prime}\right) q\left(x^{\prime} \rightarrow x\right)}\right) \\ &=P\left(x^{\prime}\right) T\left(x^{\prime} \rightarrow x\right) \end{aligned}\]

For $x=x’$, we only need to prove $T(x\rightarrow x) = T(x\rightarrow x)$ which is obvious.

This property of Metropolis-Hastings aligned with (3), a detailed balance condition. Using theorem (2) with stationary distribution $\pi(x) = P(x)$ , we have

\[\begin{aligned}\nu&=\min _{x} \min _{x^{\prime}: \pi\left(x^{\prime}\right)>0} T\left(x \rightarrow x^{\prime}\right) / \pi\left(x^{\prime}\right) \\ &=\min _{x} \min _{x^{\prime}: \pi\left(x^{\prime}\right)>0} \frac{q(x\rightarrow x')\min \left( 1, \frac{P(x')q(x'\rightarrow x)}{P(x)q(x\rightarrow x')}\right)}{P(x')} \\ &= \min _{x} \min _{x^{\prime}: \pi\left(x^{\prime}\right)>0} \min \left(\frac{q(x\rightarrow x')}{P(x')}, \frac{q(x'\rightarrow x)}{P(x)} \right) >0 \end{aligned}\]

due to the proposal density is always greater than $0$ and $P(x) \neq 0$ with all sampling point $x$. Indeed, assume $x^{(i)}$ is the first $x$ such that $P(x) = 0$ or $P(x^{(i)})=0$, then there’s a proposal point $x’=x^{(i)}$ is accepted with $a= \min \left(1, \frac{P\left(x^{\prime}\right) q\left(x^{\prime} \rightarrow x^{(i)}\right)}{P(x^{(i)}) q\left(x^{(i)} \rightarrow x^{\prime}\right)} \right) = 0$ meaning $x’$ can not be accepted.

3. References

[1]

[2]

Does a MCMC fulfilling detailed balance yields a stationary distribution?

[3]

(ML 18.2) Ergodic theorem for Markov chains

[4]

[5]

[6]

Stationary and Limiting Distributions

[7]

[8]

[9]

[Chapter 9 Stationary Distribution of Markov Chain (Lecture on 02/02/2021)STAT 243: Stochastic Process](https://bookdown.org/jkang37/stochastic-process-lecture-notes/lecture09.html)