2023-10-11
Data: \mathcal{D}=\lbrace x_1,x_2,\cdots,x_n \rbrace (x_i 可能有重複).
Sample from \mathcal{D} \iff 從 \mathcal{D} 隨意挑一個出來.
Example: \mathcal{D} = \lbrace 1,2,2,3 \rbrace.
從 \mathcal{D} 重複選取 15 個.
Example: Sample \mathcal{N}(\mathbf{0},(0.01) \mathbf{I}) in \mathbb R^{32\times 32}.
Let X be a random vector in \mathbb R^n. The \color{red}{\textbf{density}} function of X is defined by \begin{aligned} \mathbf{P}\bigl[ X\in A \bigr] = \int_{x\in A} {\color{red}{f_X}}(x) \mathrm{d}x, \quad \forall A\in \mathcal{B}(\mathbb R^n). \end{aligned}
f_{X\vert Y}(x\vert y) = \dfrac{f_{X,Y}(x,y)}{f_Y(y)}.
Since \begin{aligned} f_{X,Y}(x,y) = f_{X}(x) \cdot f_{Y\vert X}(y\vert x), \end{aligned} we can sample (x,y)\sim f_{X,Y}(x,y) by the follows:
X_{t:0}:=(X_t,X_{t-1},\cdots,X_1,X_0). Others are the same.
f_{X_t\vert X_{t-1:0}}(x_t\vert x_{t-1:0}) = q(x_t\vert x_{t-1:0}). Others are the same.
Table of Contents
Denoising Diffusion Probabilistic Models (DDPM)
To determine μθ,Σθ
Training and Sampling
Appendix
The name “diffusion” comes from the diffusion process.
Let q(x_0) be the distribution of our data.
Recall that our goal is to construct p_{\theta}(x_0) such that \begin{aligned} p_{\theta}(x_0)\approx q(x_0). \end{aligned}
The simplest way to construct p_{\theta}(x_0) is as follows:
Although we don’t know what distribution q(x_{t-1}\vert x_{t}) is, we can approximate it with a normal.
Why?
The process X_0,X_1,\cdots,X_T, under q, is a discretization of some process (\widetilde{X}_t)_{t\in [0,1]} which satisfies the SDE \begin{aligned} \mathrm{d}\widetilde{X}_t = \mu(\widetilde{X}_t,t) \mathrm{d}t + \sigma(t) \mathrm{d}B_t. \end{aligned}
Consider the reverse-time process of (\overline{X}_t)_{t\in [0,1]}: \begin{aligned} \overline{X}_t := \widetilde{X}_{1-t}, \quad t \in [0,1]. \end{aligned}
By Bayes’ theorem and Taylor’s theorem, \begin{aligned} q(x_{t-1}\vert x_t) &= \frac{q(x_t\vert x_{t-1})q(x_{t-1})}{q(x_{t})} \cr &= \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1},\beta_t \mathbf{I}) \cdot \exp\bigl( \log q(x_{t-1}) - \log q(x_t) \bigr) \cr &\approx c \cdot \exp\Biggl\lbrace \underbrace{-\frac{\alpha_t}{2\beta_t} \Bigl( x_{t-1} - \frac{1}{\sqrt{\alpha_t}} x_t \Bigr)^2 + (x_{t-1}-x_t) \nabla_{x_t}\log q(x_t) }_{\text{A quadratic polynomial of }x_{t-1}\text{ with negative leading coefficient}} \Biggr\rbrace. \end{aligned}
Hence, q(x_{t-1} \vert x_t) can be approximated by Gaussian distribution.
Table of Contents
Denoising Diffusion Probabilistic Models (DDPM)
To determine μθ,Σθ
Training and Sampling
Appendix
Remark. \begin{aligned} p_{\theta}(x_{t-1}\vert x_t) = {\color{red}{\mathcal{N}\bigl(x_{t-1}; \mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)\bigr)}}. \end{aligned}
The main purpose of the diffusion model is to learn a distribution p_{\theta}(x_0) such that p_{\theta}(x_0)\approx q(x_0).
One way is to minimize D_{\mathtt{KL}}(q(x_0) \,\Vert\, p_{\theta}(x_0)).
Our goal becomes to find \begin{aligned} \mu_{\theta}^*, \Sigma_{\theta}^* &= \arg \min_{\mu_{\theta},\Sigma_{\theta}} D_{\mathtt{KL}} \bigl( q(x_0) \big\Vert p_{\theta}(x_0) \bigr) \cr &= \arg \min_{\mu_\theta,\Sigma_\theta} \biggl( -\int q(x_0) \log \Bigl( \frac{p_{\theta}(x_0)}{q(x_0)} \Bigr) \mathrm{d}x_0 \biggr) \cr &= \arg \min_{\mu_{\theta},\Sigma_{\theta}} \biggl( \underbrace{-\int q(x_0) \log p_{\theta}(x_0) \mathrm{d}x_0}_{\color{blue}{\mathbb E_{X_0\sim q(x_0)}[-\log p_{\theta}(X_0)]}} \biggr). \end{aligned}
By the evidence lower bound(ELBO), \begin{aligned} -\log p_{\theta}(x_0) \leq \mathbb E_{X_{1:T}\sim q(x_{1:T} \vert x_0)} \Bigl[ -\log \frac{p_{\theta}(x_0,X_{1:T})}{q(X_{1:T}\vert x_0)} \Bigr]. \end{aligned} Hence, \begin{aligned} {\color{blue}{\mathbb E_{X_0\sim q(x_0)}[-\log p_{\theta}(X_0)]}} \leq \mathbb E_{X_{0:T}\sim q(x_{0:T})} \Bigl[ -\log \frac{p_{\theta}(X_{0:T})}{q(X_{1:T}\vert X_0)} \Bigr]:= L. \end{aligned}
Our goal becomes to minimize L.
Note that \begin{aligned} L &= \mathbb E_{X_0\sim q(x_0)} \biggl[ D_{\mathtt{KL}} \Bigl( \underline{q(x_T \vert x_0)} \big\Vert \underline{p(x_T)} \Bigr) \Big\vert_{x_0=X_0} \biggr] \cr & \qquad + \sum_{t=2}^T \underbrace{\mathbb E_{X_0,X_t\sim q(x_0,x_{t})} \biggl[ D_{\mathtt{KL}} \Bigl( {\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}} \big\Vert \underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} } \Bigr)\Big \vert_{x_0,x_t=X_0,X_t} \biggr]}_{L_{t-1}} \cr & \qquad \qquad + \underbrace{\mathbb E_{X_0,X_1\sim q(x_0,x_1)} \biggl[ -\log {\color{blue}{p_{\theta}(x_0 \vert x_1)}} \Big\vert_{x_0,x_1=X_0,X_1} \biggr]}_{L_0}. \end{aligned}
To minimize L \iff To minimize L_{t-1},t=1,\cdots,T.
We focus on t\geq 2.
By Bayes’ rule and after a long calculation, \begin{aligned} q(x_{t-1} \vert x_t,x_0) = \mathcal{N}\bigl( x_{t-1}; \mu_{t}(x_t,x_0),\Sigma_t \bigr), \quad t = 2,\cdots,T, \end{aligned} where \begin{aligned} \mu_{t}(x_t,x_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1-\overline{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_t)}{1-\overline{\alpha}_t}x_t , \quad \Sigma_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\beta_t. \end{aligned}
Recall that L_{t-1}=\mathbb E_{X_0,X_t\sim q(x_0,x_{t})} \biggl[ D_{\mathtt{KL}} \Bigl( {\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}}\big\Vert \underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} }\Bigr)\Big \vert_{x_0,x_t=X_0,X_t} \biggr].
For each t=2,\cdots, T, our goal is to minimize \begin{aligned} D_{\mathtt{KL}} \Bigl( {\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}} \big\Vert \underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} } \Bigr), \end{aligned} where \begin{aligned} {\color{red}{q(x_{t-1} \vert x_t,x_0)}} &= \mathcal{N}\bigl( x_{t-1}; \mu_{t}(x_t,x_0),\Sigma_t \bigr) ,\cr {\color{blue}{p_{\theta}(x_{t-1}\vert x_t)}} &= \mathcal{N}(x_{t-1};{\color{green}{\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)}}) \end{aligned}
with \begin{aligned} \mu_{t}(x_t,x_0) &= \frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1-\overline{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_t}x_t , \cr \Sigma_t &= \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\beta_t \mathbf{I}. \end{aligned}
Want: \mu_{\theta}(x_t,t)\approx \mu_t(x_t,x_0).
Try to set \mu_{\theta}(x_t,t)=\mu_t(x_t,\widehat{x}_0), where \begin{aligned} \widehat{x}_0=\widehat{x}_0(x_t,t). \end{aligned}
Write \begin{aligned} X_t = \sqrt{\overline{\alpha}_t} X_0 + \sqrt{1-\overline{\alpha}_t} \overline{\varepsilon}_t. \end{aligned} \tag{3} Then under q, \overline{\varepsilon}_t\perp X_0, \overline{\varepsilon}_t\sim N(\mathbf{0},\mathbf{I}).
For each t, we let \color{blue}{\widehat{x}_0=\widehat{x}_0(x_t,t)} s.t. \begin{aligned} x_t = \sqrt{\overline{\alpha}_t} {\color{blue}{\widehat{x}_0}} + \sqrt{1-\overline{\alpha}_t} {\color{red}{\varepsilon_{\theta}(x_t,t)}}, \end{aligned} where {\color{red}{\varepsilon_{\theta}}} is our model to predict the {\color{red}{\textbf{real noise }\overline{\varepsilon}_t}} by giving x_t,t.
We replace x_0 by (3) and we get \begin{aligned} \mu_t(x_t,x_0) &= \mu_t \Bigl( x_t, \frac{1}{\sqrt{\overline{\alpha}_t}}\bigl( x_t-\sqrt{1-\overline{\alpha}_t} \overline{\varepsilon}_t \bigr) \Bigr) \cr &=\frac{1}{\sqrt{\alpha_t}} \Bigl( x_t - \frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}} {\color{red}{\overline{\varepsilon}_t}} \Bigr). \end{aligned}
\color{red}{\textbf{We reparametrise}} \mu_{\theta} by \begin{aligned} {\color{green}{\mu_{\theta} (x,t)}} = \mu_t(x,\widehat{x}_0(x,t)) = \frac{1}{\sqrt{\alpha_t}} \Bigl( x - \frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}} {\color{red}{\varepsilon_{\theta} (x,t) }} \Bigr). \end{aligned}
Given t,x_t, \begin{aligned} \text{predict }\overline{\varepsilon}_t \iff {\color{blue}{\text{predict } X_0}}. \end{aligned}
Hence,
\begin{aligned}
D_{\mathtt{KL}} \Bigl(
{\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}}
\big\Vert
\underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} }
\Bigr)
= \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\overline{\alpha}_t)} \left\lVert \overline{\varepsilon}_t - \varepsilon_{\theta}(x_t,t) \right\rVert^2.
\end{aligned}
Table of Contents
Denoising Diffusion Probabilistic Models (DDPM)
To determine μθ,Σθ
Training and Sampling
Appendix
Remark. We will minimize \mathbb E_{X\sim q(x)}[f_{\theta}(X)] by
repeat until converge:
Remark. We may sample (x,y)\sim q(x,y) by
Minimize each L_{t-1}=\mathbb E_{X_0,X_t\sim q(x_0,x_{t})} \biggl[ D_{\mathtt{KL}} \Bigl( {\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}} \big\Vert \underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} } \Bigr)\Big \vert_{x_0,x_t=X_0,X_t} \biggr] by the follows:
We can write Algorithm 2 shorter as Algorithm 3.
mnist
mnist
fashion_mnist
- The main purpose of all generative models is to learn a distribution p_{\theta}(x_0) such that \begin{aligned} p_{\theta}(x_0)\approx q(x_0). \end{aligned}
Our goal: Find \begin{aligned} \mu_{\theta}^*, \Sigma_{\theta}^* = \arg \min_{\mu_{\theta},\Sigma_{\theta}} D_{\mathtt{KL}} \bigl( q(x_0) \big\Vert p_{\theta}(x_0) \bigr) = \arg \min_{\mu_{\theta},\Sigma_{\theta}} \underbrace{\color{blue}{\mathbb E_{X_0\sim q(x_0)}[-\log p_{\theta}(X_0)]}}_{:=L}. \end{aligned}
By some calculation, \begin{aligned} L &= \mathbb E_{X_0\sim q(x_0)} \biggl[ D_{\mathtt{KL}} \Bigl( \underline{q(x_T \vert x_0)} \big\Vert \underline{p(x_T)} \Bigr) \Big\vert_{x_0=X_0} \biggr] + \sum_{t=2}^T \underbrace{\mathbb E_{X_0,X_t\sim q(x_0,x_{t})} \biggl[ D_{\mathtt{KL}} \Bigl( {\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}} \big\Vert \underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} } \Bigr)\Big \vert_{x_0,x_t=X_0,X_t} \biggr]}_{L_{t-1}} \cr & \qquad \qquad + \underbrace{\mathbb E_{X_0,X_1\sim q(x_0,x_1)} \bigl[ -\log {\color{blue}{p_{\theta}(X_0 \vert X_1)}} \bigr]}_{L_0}. \end{aligned}
In DDPM, we choose the special \mu_{\theta},\Sigma_{\theta} so that \begin{aligned} D_{\mathtt{KL}} \Bigl( {\underline{\color{red}{q(x_{t-1} \vert x_t,x_0)}}} \big\Vert \underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} } \Bigr) = \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\overline{\alpha}_t)} \left\lVert \overline{\varepsilon}_t - \varepsilon_{\theta}(x_t,t) \right\rVert^2, \end{aligned} where {\color{red}{\varepsilon_{\theta}}} is our model to predict the real noise \overline{\varepsilon}_t by giving t,x_t.
“An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula.”
Table of Contents
Denoising Diffusion Probabilistic Models (DDPM)
To determine μθ,Σθ
Training and Sampling
Appendix
Let X=[X_1,\cdots,X_n]^T\sim \mathcal{N}(\mu,\Sigma) in \mathbb R^n, where \mu \in \mathbb R^n and \Sigma \in \mathbb R^{n\times n} is a positive semi-definite matrix.
For n=1, we rewrite X\sim \mathcal{N}(\mu,\sigma^2) in \mathbb R, where \mu,\sigma\in \mathbb R. Then \begin{aligned} f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \Big\lbrace-\frac{1}{2}\cdot \frac{(x-\mu)^2}{\sigma^2} \Big\rbrace, \quad x \in \mathbb R. \end{aligned} Or, equivalently, there exists Z\sim \mathcal{N}(0,1) in \mathbb R such that \begin{aligned} X = \sigma Z + \mu. \end{aligned}
For general n\in \mathbb N. \begin{aligned} f_{X}(x) = \frac{1}{\sqrt{(2\pi)^{n}\det(\Sigma)}}\exp\Bigl\lbrace -\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \Bigr\rbrace, \quad x \in \mathbb R^n. \end{aligned}
假設一個 random element X on \mathcal{X} 上有兩個 density q,p, 其中
q 是我們真正有的 density,
p=p_{\theta} 是我們 model 要逼近 q 的 density.
有個衡量 q,p 差異性的是 \mathtt{KL}-divergence: \begin{aligned} D_{\mathtt{KL}}(q \Vert p) = \mathbb E_{X\sim q}\Bigl[ \log\frac{q(X)}{p(X)} \Bigr] = \int_{x\in \mathcal{X}} q(x) \log \frac{q(x)}{p(x)} \mathrm{d}x. \end{aligned}
Minimize KL divergence \iff Maximize likelihood.
For any p,q, D_{\mathtt{KL}}(q\Vert p) \geq 0.
KL-divergence 在 q,p 都是 normal 時有解析式.
Let \lbrace x^{(1)},\cdots, x^{(n)} \rbrace be samples from q. Then \begin{aligned} \theta^{\star} &= \arg \max_{\theta} \log \prod_{i=1}^n p_{\theta}(x^{(i)}) = \arg \max_{\theta} \sum_{i=1}^n \log p_{\theta}(x^{(i)}) \cr &= \arg \max_{\theta} \frac{1}{n}\sum_{i=1}^n \log p_{\theta}(x^{(i)}) \stackrel{\text{SLLN}}{\approx} \arg \max_{\theta} \mathbb E_{X\sim q}\bigl[ \log p_{\theta}(X) \bigr] \cr &= \arg \max_{\theta} \int_{x} q(x) \log p_{\theta}(x) \mathrm{d}x - \int_{x} q(x) \log q(x) \mathrm{d}x \cr &= \arg \max_{\theta} \int_{x} q(x) \log\frac{p_{\theta}(x)}{q(x)} \mathrm{d}x = \arg\min_{\theta} D_{\mathtt{KL}} (q \Vert p_{\theta}). \end{aligned}