A

Miscellaneous

Synthetic Experiments For the Capacity Landscape

This appendix derives the closed‑form formulas used for Figure 1.3.

Classical Bias–Variance Curve

In each Monte Carlo replicate, draw training inputs \(x_i \overset{\text{iid}}{\sim} \mathcal{U}(0,1)\) and corresponding targets \(t_i = \sin(2\pi x_i) + \varepsilon_i, \varepsilon_i \sim \mathcal{N}(0,\sigma^2), \sigma=0.1.\) Use \(N_{\text{train}}=40\) samples for training and evaluate performance on \(N_{\text{test}}=1\ 000\) fresh points drawn from the same distribution.

For each replicate, fit ordinary least-squares polynomials of degree \(m=0,\dots,10\). The test mean squared error (MSE) of degree \(m\) is given by \begin{equation} \frac{1}{N_{\text{test}}} \sum_{j=1}^{N_{\text{test}}} \bigl(y_m(x_j) - t_j^{\text{test}}\bigr)^2, \end{equation} where \(y_m\) denotes the fitted polynomial of degree \(m\). The curve shown in the figure is obtained by averaging this quantity over \(1\ 000\) independent replicates, yielding a smooth estimate of the expected risk.

Double‑descent Experiment

The purpose of this experiment is to illustrate the double-descent phenomenon in linear regression, where the test error first decreases, then increases sharply around the interpolation threshold, and finally decreases again as the model becomes increasingly overparameterized. Fix the number of training samples to \begin{equation} N_{\text{train}} = 50. \end{equation} The data is generated from a low-dimensional linear model \begin{equation} y = X_{1:d_0} w_\star + \varepsilon, \end{equation} where

  • \(d_0 = 10\) is the true intrinsic dimensionality,

  • \(w_\star \sim \mathcal{N}(0, I)\) are the ground-truth regression weights,

  • \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\) with \(\sigma^2 = 2\) is additive Gaussian noise.

The design matrix \begin{equation} X \in \mathbb{R}^{N_{\text{train}} \times p} \end{equation} has entries \(X_{ij} \sim \mathcal{N}(0,1)\). The feature dimension \(p\) is varied from \(p=1\) to \(p=100\). Within a single run, the same training set \((X, y)\) is reused across all values of \(p\), progressively increasing the number of available features. For each \(p\), compute the minimum-norm least-squares estimator \begin{equation} \hat{w} = X^\top (XX^\top)^{-1} y, \end{equation} which coincides with the standard least-squares solution when \(p < N_{\text{train}}\), and interpolates the training data exactly when \(p \geq N_{\text{train}}\).

Generalization performance is assessed using the mean squared error (MSE) on \(10\ 000\) fresh test points, sampled independently from the same distribution as the training data. To reduce variance and highlight the characteristic shape of the error curve, results are averaged over 200 independent repetitions of the experiment, each with newly drawn \((X, w_\star, \varepsilon)\).

Kullback–Leibler Divergence Between Multivariate Gaussians

In this appendix, the Kullback–Leibler (KL) divergence between two multivariate Gaussian distributions is derived. This quantity admits a closed-form expression, which will prove useful in subsequent sections. Let \(P\) and \(Q\) denote two \(k\)-dimensional Gaussian distributions over the same random variable \(\bm{a}\), defined as \begin{equation} P(\bm a) = \mathcal{N}(\bm a\,|\, \bm \mu_1, \bm \Sigma_1), \quad Q(\bm a) = \mathcal{N}(\bm a\,|\, \bm \mu_2, \bm \Sigma_2). \end{equation} The KL divergence from \(P\) to \(Q\) is defined as \begin{equation} \mathrm{KL}(P\,|\,Q) = \mathbb{E}_{P(\bm a)}\left[\log P(\bm a) - \log Q(\bm a)\right]. \end{equation} The log-density of a multivariate Gaussian distribution can be expressed as \begin{equation} \begin{aligned} \log P(\bm a) &= -\frac{k}{2}\log(2\pi) - \frac{1}{2} \log \det(\bm \Sigma_1) - \frac{1}{2} (\bm a - \bm \mu_1)^\top \bm \Sigma_1^{-1} (\bm a - \bm \mu_1) \\ &= -\frac{k}{2}\log(2\pi) - \frac{1}{2} \log \det(\bm \Sigma_1) - \frac{1}{2} \operatorname{tr}\left[\bm \Sigma_1^{-1} (\bm a - \bm \mu_1)(\bm a - \bm \mu_1)^\top \right]. \end{aligned} \end{equation} Since the trace operator is linear, its expectation under \(P\) simplifies as follows: \begin{equation} \begin{aligned} \mathbb{E}_{P(\bm a)}\left[\operatorname{tr}\left(\bm \Sigma_1^{-1} (\bm a - \bm \mu_1)(\bm a - \bm \mu_1)^\top \right)\right] &= \operatorname{tr}\left(\bm \Sigma_1^{-1} \mathbb{E}_{P(\bm a)}\left[(\bm a - \bm \mu_1)(\bm a - \bm \mu_1)^\top \right]\right) \\ &= \operatorname{tr}\left(\bm \Sigma_1^{-1} \bm \Sigma_1\right) = k. \end{aligned} \end{equation} The corresponding quadratic term in the expression for \(\log Q(\bm a)\). Then, \begin{equation} \begin{aligned} \mathbb{E}_{P(\bm a)}\left[\operatorname{tr}\left(\bm \Sigma_2^{-1} (\bm a - \bm \mu_2)(\bm a - \bm \mu_2)^\top \right)\right] &= \operatorname{tr}\left(\bm \Sigma_2^{-1} \mathbb{E}_{P(\bm a)}\left[(\bm a - \bm \mu_2)(\bm a - \bm \mu_2)^\top \right]\right) \\ &= \operatorname{tr}\left(\bm \Sigma_2^{-1} \left( \bm \Sigma_1 + (\bm \mu_1 - \bm \mu_2)(\bm \mu_1 - \bm \mu_2)^\top \right) \right) \\ &= \operatorname{tr}(\bm \Sigma_2^{-1} \bm \Sigma_1) + (\bm \mu_1 - \bm \mu_2)^\top \bm \Sigma_2^{-1} (\bm \mu_1 - \bm \mu_2). \end{aligned} \end{equation} Combining the above components, the KL divergence becomes \begin{equation} \mathrm{KL}(P\,|\,Q) = \frac{1}{2} \left[ \log \frac{\det(\bm \Sigma_2)}{\det(\bm \Sigma_1)} - k + \operatorname{tr}(\bm \Sigma_2^{-1} \bm \Sigma_1) + (\bm \mu_1 - \bm \mu_2)^\top \bm \Sigma_2^{-1} (\bm \mu_1 - \bm \mu_2) \right]. \end{equation} In the special case where \(\bm \mu_2 = \bm 0\) and \(\bm \Sigma_2 = \bm I\), the expression simplifies further to \begin{equation} \mathrm{KL}(P\,|\,\mathcal{N}(\bm 0, \bm I)) = \frac{1}{2} \left[ -\log \det(\bm \Sigma_1) - k + \operatorname{tr}(\bm \Sigma_1) + \bm \mu_1^\top \bm \mu_1 \right]. \end{equation}

Gaussian Expectation of Powered Likelihoods

In this section, the closed-form expression for the expectation of a powered Gaussian likelihood under a Gaussian distribution is derived; which is relevant for evaluating generalized divergences such as the \(\alpha\)-divergence. Specifically, let \(Q(f) = \mathcal{N}(\mu, \Sigma)\) be a Gaussian distribution over the latent variable \(f\), and let the likelihood be Gaussian as well, \(P(y | f) = \mathcal{N}(y | f, \sigma^2)\). Consider the case where the likelihood is raised to a power \(\alpha \in (0,1)\).

Begin by observing that raising a Gaussian density to the power \(\alpha\) yields an unnormalized Gaussian: \begin{equation} P(y | f)^\alpha = \left(\frac{1}{\sqrt{2\pi} \sigma} \exp\left\{ -\frac{1}{2} \frac{(y - f)^2}{\sigma^2} \right\} \right)^\alpha = \frac{1}{(2\pi \sigma^2)^{\alpha/2}} \exp\left\{ -\frac{1}{2} \frac{(y - f)^2}{\sigma^2 / \alpha} \right\}. \end{equation} This expression corresponds to the density of a Gaussian distribution with variance \(\sigma^2 / \alpha\), up to a normalization constant. That is, \begin{equation} P(y | f)^\alpha = \left( \frac{2\pi \sigma^2}{\alpha} \right)^{1/2} \cdot \frac{1}{(2\pi \sigma^2)^{\alpha/2}} \cdot \mathcal{N}(y | f, \sigma^2/\alpha). \end{equation} Compute the expectation of this unnormalized expression under the distribution \(Q(f)\): \begin{equation} \begin{aligned} \mathbb{E}_{Q(f)}\left[ P(y | f)^\alpha \right] &= \left( \frac{2\pi \sigma^2}{\alpha} \right)^{1/2} \cdot \frac{1}{(2\pi \sigma^2)^{\alpha/2}} \cdot \mathbb{E}_{Q(f)}\left[ \mathcal{N}(y | f, \sigma^2/\alpha) \right]. \end{aligned} \end{equation} The remaining expectation involves the convolution of two Gaussian densities. Since the convolution of Gaussians is Gaussian: \begin{equation} \mathbb{E}_{Q(f)}\left[ \mathcal{N}(y | f, \sigma^2/\alpha) \right] = \mathcal{N}(y | \mu, \Sigma + \sigma^2 / \alpha). \end{equation} Hence, the final expression becomes \begin{equation} \mathbb{E}_{Q(f)}\left[ P(y | f)^\alpha \right] = \left( \frac{2\pi \sigma^2}{\alpha} \right)^{1/2} \cdot \frac{1}{(2\pi \sigma^2)^{\alpha/2}} \cdot \mathcal{N}(y | \mu, \Sigma + \sigma^2/\alpha). \end{equation}