Diffusion Models
An exploration of the mathematical foundations of diffusion models, covering forward and reverse processes, ELBO, and training objectives.
This is a list of books, essays, ideas, and instruction manuals that I have found interesting. Highlighted in green are the ones that I wanted to highlight in green.
Diffusion models approximate a data distribution $q(x)$ by gradually adding noise to the data through a forward process and learning to reverse this corruption via a generative reverse process. The goal is to produce samples $x^{(0)} \sim p_\theta(x^{(0)})$ that resemble data $x \sim q(x)$.
The forward process progressively transforms data into noise by defining a Markov chain:
\[q(x^{(1:T)} | x^{(0)}) = \prod_{t=1}^T q(x^{(t)} | x^{(t-1)})\]where each step adds Gaussian noise:
\[q(x^{(t)} | x^{(t-1)}) = \mathcal{N}(x^{(t)}; \sqrt{\alpha_t} x^{(t-1)}, \beta_t I)\]Here, $\alpha_t = 1 - \beta_t$ controls the noise scale. Over $T$ steps, the data distribution $q(x^{(0)})$ transitions to an approximate Gaussian: $q(x^{(T)}) \approx \mathcal{N}(0, I)$.
The reverse process aims to denoise step-by-step, reconstructing data from noise:
\[p_\theta(x^{(0:T)}) = p(x^{(T)}) \prod_{t=1}^T p_\theta(x^{(t-1)} | x^{(t)})\]The reverse conditional $p_\theta(x^{(t-1)} | x^{(t)})$ is parameterised as: |
In practice, $\Sigma_\theta$ is often fixed to simplify training, leaving $\mu_\theta$ as the primary learnable component.
To train the model, we maximise a variational lower bound (ELBO) on the data log-likelihood. Using the forward process as the variational posterior, the ELBO is:
\[\log p_\theta(x^{(0)}) \geq \mathbb{E}_{q(x^{(1:T)} | x^{(0)})}\left[\log \frac{p_\theta(x^{(0:T)})}{q(x^{(1:T)} | x^{(0)})}\right]\]Substituting the Markov chain formulations of $q$ and $p_\theta$, the ELBO expands to:
\[\mathbb{E}_{q}\left[-\sum_{t=1}^T \text{KL}\big(q(x^{(t-1)} | x^{(t)}, x^{(0)}) \| p_\theta(x^{(t-1)} | x^{(t)})\big) + \log p_\theta(x^{(0)} | x^{(1)}) \right]\]The KL divergence terms ensure the learned reverse process closely approximates the true dynamics of the forward process.
The forward process allows direct sampling of any step $t$ given the clean data $x^{(0)}$. This property avoids the need to sample every intermediate step:
\[q(x^{(t)} | x^{(0)}) = \mathcal{N}\big(x^{(t)}; \sqrt{\bar{\alpha}_t} x^{(0)}, (1 - \bar{\alpha}_t) I\big)\]where $\bar{\alpha}t = \prod{s=1}^t \alpha_s$. Rewriting this, $x^{(t)}$ is expressed as:
\[x^{(t)} = \sqrt{\bar{\alpha}_t} x^{(0)} + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]This reparameterisation simplifies the reverse process, as $\mu_\theta$ can be redefined in terms of the original data $x^{(0)}$ and added noise $\epsilon$.
For training, the reverse process is reformulated as predicting the noise $\epsilon$ used in the forward process. The simplified objective becomes:
\[L(\theta) = \mathbb{E}_{x^{(0)}, \epsilon, t}\left[\| \epsilon - \epsilon_\theta(x^{(t)}, t) \|^2\right]\]where $\epsilon_\theta$ is a neural network prediction of the noise. Intuitively, this task requires the network to denoise the input $x^{(t)}$.
At test time, sampling involves starting from $x^{(T)} \sim \mathcal{N}(0, I)$ and iteratively applying the learned reverse process:
\[x^{(t-1)} \sim p_\theta(x^{(t-1)} | x^{(t)})\]This produces a final sample $x^{(0)}$ that approximates the data distribution $q(x)$.
Diffusion models are robust due to their reliance on Gaussian transitions and the reparameterised noise prediction objective.