1

Introduction

Thesis Outline and Contributions

Machine learning is a field of computer science that studies how to build algorithms able to automatically discover regularities in data (Bishop, 2006; Hastie et al., 2001). Throughout this thesis, such a regularity, or pattern, is understood as any recurrent structure that can be captured by a predictive rule with minimal human guidance. The focus of this thesis narrows to the supervised setting. A training dataset is denoted by \begin{equation} \mathcal{D} =\bigl\{(\mathbf{x}_1,y_1),\dots,(\mathbf{x}_n,y_n)\bigr\} \subset \mathbb{R}^D \times \mathbb{R}^C\,, \end{equation} where each \(\mathbf{x}_i\) represents an input (attribute) vector and \(y_i\) its associated target value. Supervised learning aims to infer an unknown function \(f:\mathbb{R}^D \to \mathbb{R}^C\) that defines the underlying process of the observed data, that is, \(y \approx f(\mathbf{x})\). The goal is to learn or approximate such mapping \(f\). When \(y\) ranges over a continuous domain (e.g. \(\mathbb{R}\)) the problem is termed regression; when \(y\) takes values in a finite set such as \(\{-1,1\}\) it is called classification, and \(y\) is then the class label.

In practice, a hypothesis space \(\mathcal{F}\) is chosen, and the objective is to determine the element in such space that is closer to the unknown function \(f\). The usual approach is to define a parametric family of functions, that is, a hypothesis space that solely depends on a set of vector parameters \(\bm \theta \in \bm \Theta \subset \mathbb{R}^P\). An example of this characterization is Neural Networks (NNs), where once an architecture is fixed, a family of functions is defined for each configuration of parameters.

It is common to use the subscript notation to denote a predictor that is fully determined by its parameters as \(f_{\bm{\theta}}\). The aim is then to find the set of parameters \(\hat{\bm{\theta}}\), using the given samples \(\mathcal{D}\) so that \(f_{\bm{\theta}}\) approximates \(f\) as closely as possible (Bishop, 1995; Hastie et al., 2001). The resulting predictor should not only reproduce the training targets but also generalize—that is, maintain its accuracy on new, unseen observations.

Two practical difficulties complicate this endeavor. Firstly, supervised datasets are often limited in size and may be corrupted by random noise (Bishop, 2006), which inflates the variance of any parameter estimate. Secondly, distinguishing intrinsic regularities from spurious fluctuations produced by chance is difficult. Consequently, selecting \(\mathcal{F}\) is critical: if the class of candidate functions is too restrictive, the learned rule under-fits fails to capture genuine structure; if it is overly flexible, the rule over-fits compromises by adapting to noise. Both extremes deteriorate predictive performance on future data.

Illustrative toy example

To make the notions of model complexity, underfitting, and overfitting concrete, a simple regression experiment is presented; see Figure 1.1. The task is to recover an unknown input–output relationship from examples. In supervised learning, a collection of input–output pairs \begin{equation} \mathcal{D} \;=\; \{(x_i,y_i)\}_{i=1}^{N} \end{equation} is observed, where \(x_i\in[0,1]\) is an input (here, a scalar) and \(t_i\in\mathbb{R}\) is a noisy observation of an underlying target function. In this example, the data are generated by \begin{equation} y \;=\; \sin(2\pi x)\;+\;\varepsilon, \qquad \varepsilon \sim \mathcal{N}(0,\sigma^2),\quad \sigma=0.3, \qquad x \sim \mathcal{U}(0,1). \end{equation} The goal of learning is to produce a predictive rule \(x\mapsto f(x)\) that maps new, unseen inputs to accurate outputs.

image image
image

Figure 1.1: Illustrative polynomial curve–fitting experiment showing the perils of model complexity and the benefits of regularization. Orange curves are fitted polynomials, blue dots noisy training samples, and dashed blue line the true function \(f(x)=\sin(2\pi x)\). Top left: an unregularized ninth‑degree polynomial interpolates the data but oscillates wildly, a classic instance of over‑fitting. Top right: a cubic model (\(M=3\)) captures the main trend yet misses some curvature, exhibiting mild under‑fitting. Bottom: the same ninth‑degree model with \(\ell_{2}\) regularization (\(\lambda>0\)) suppresses extreme oscillations, striking a better bias–variance balance.

A widely used approach is to restrict attention to a family (hypothesis class) of candidate functions and to select within that family the function that best matches the observed data. Here the family of univariate polynomials of maximum degree \(M\) is considered: \begin{equation} \mathcal{F}_M \;=\; \Bigl\{\, f_{\mathbf{w}}(x) \;=\; \sum_{m=0}^{M} w_m\,x^{m} \ \Bigm|\ \mathbf{w}\in\mathbb{R}^{M+1}\Bigr\}. \end{equation} Where the set of parameters \(\bm{\theta}\) correspond to the set of coefficients \(\mathbf{w}\). Define the basis vector \(\bm{\phi}(x)=(1,x,\dots,x^{M})^{\top}\) and the design matrix, \begin{equation} \bm{\Phi}\;=\;\bigl[\bm{\phi}(x_1)\;\cdots\;\bm{\phi}(x_N)\bigr]^{\top}\in\mathbb{R}^{N\times(M+1)}, \qquad \mathbf{y}=(y_1,\dots,y_N)^{\top}. \end{equation} The set of coefficients that minimizes the difference between predictions and true labels is the ordinary least-squares (OLS) estimator: \begin{equation} \hat{\mathbf{w}} \;=\; \operatorname*{arg\,min}_{\mathbf{w}}\, \bigl\lVert \bm{\Phi}\mathbf{w}-\mathbf{y}\bigr\rVert_2^{2} \;=\; \bigl(\bm{\Phi}^{\top}\bm{\Phi}\bigr)^{-1}\bm{\Phi}^{\top}\mathbf{y}. \end{equation} To mitigate sensitivity to noise and numerical instabilities when \(M\) is large, a quadratic penalty (ridge or Tikhonov regularization) is commonly added: \begin{equation} \hat{\mathbf{w}} \;=\; \operatorname*{arg\,min}_{\mathbf{w}} \bigl\lVert \bm{\Phi}\mathbf{w}-\mathbf{y} \bigr\rVert_2^{2} \;+\; \lambda\,\lVert \mathbf{w}\rVert_2^{2} \;=\; \bigl(\bm{\Phi}^{\top}\bm{\Phi}+\lambda\mathbf{I}\bigr)^{-1}\bm{\Phi}^{\top}\mathbf{y}, \qquad \lambda>0. \end{equation} Intuitively, the penalty discourages overly large coefficients, which in turn discourages rapid oscillations of the fitted polynomial.

Model complexity, under-fitting, and over-fitting.

The degree \(M\) controls the expressive power (capacity) of the model class. Small \(M\) yields simple functions that may be unable to capture the true sinusoidal shape; this is under-fitting. Large \(M\) admits highly flexible polynomials that can interpolate the training data, including the noise; this is over-fitting. In the experiment (Figure 1.1), \(M=3\) produces a smooth curve that misses some curvature (underfitting), whereas \(M=9\) fits the \(N\) points almost exactly and exhibits pronounced oscillations between them (overfitting). Introducing the \(\ell_2\) penalty (\(\lambda>0\)) or reducing \(M\) produces a smoother curve that tracks the underlying \(\sin(2\pi x)\) more faithfully and performs better on unseen inputs.

Training vs. generalization.

Performance on the observed sample (the training error) does not necessarily reflect performance on new data (the generalization error). Over-fitted models often achieve near-zero training error while performing poorly on new inputs because they have modeled the noise. Regularization (\(\lambda>0\)) or reduced complexity (\(M\) smaller) improves generalization by trading a small increase in training error for a larger decrease in prediction error on new data.

Bias–variance trade-off.

This example illustrates the classical bias–variance trade-off. Low-degree polynomials have high bias (they cannot represent the sinusoid well) but low variance (predictions are stable across datasets), whereas high-degree polynomials have low bias (they can represent complex shapes) but high variance (predictions change markedly with the sampled noise). Regularization shifts the solution toward lower variance without excessively increasing bias, often yielding the lowest total prediction error.

Interpretation.

Although polynomial regression in one dimension is deliberately simple, the same phenomena occur broadly in machine learning: richer models can interpolate, but careful control of complexity (via architecture choices, regularization, or data) is required to attain strong generalization.

Bayesian Machine Learning

In the Bayesian paradigm, probability is employed to encode a prior degree of confidence in each candidate hypothesis (Bishop, 2006; MacKay, 2003). Once a likelihood function has been posited—i.e. a mathematical description of how the observed targets could be generated—Bayes’ rule transforms these priors into posterior probabilities in light of the training evidence. Prediction then amounts to averaging the individual hypotheses’ outputs, each weighted by its posterior mass.

A convenient parameterization introduces a parameter vector \(\mathbf{w}\) indexing the hypotheses. Let \(\mathbf{X}=[\mathbf{x}_1,\dots,\mathbf{x}_N]^{\top}\in\mathbb{R}^{N\times d}\) denote the design matrix of inputs and \(\mathbf{y}=(y_1,\dots,y_N)^{\top}\in\mathbb{R}^{N}\) the vector of targets. Given a prior \(P(\mathbf{w})\) and a likelihood \(P(\mathbf{y}|\mathbf{X},\mathbf{w})=\prod_{i=1}^{N}P(y_i| \mathbf{x}_i,\mathbf{w})\), Bayes’ rule yields the posterior \begin{equation} P(\mathbf{w}| \mathcal{D}) = \frac{P(\mathbf{y}| \mathbf{X},\mathbf{w})\,P(\mathbf{w})}{P(\mathbf{y}| \mathbf{X})} \propto P(\mathbf{y}| \mathbf{X},\mathbf{w})\,P(\mathbf{w}), \qquad \mathcal{D}=\{(\mathbf{x}_i,y_i)\}_{i=1}^{N}. \end{equation} The normalizing constant \(P(\mathbf{t}| \mathbf{X}) =\int P(\mathbf{y}| \mathbf{X},\mathbf{w})\,P(\mathbf{w})\,\mathrm{d}\mathbf{w}\) is the marginal likelihood (model evidence), central to Bayesian model comparison. A point estimate such as the maximum a posteriori (MAP) solution maximizes \(\log P(\mathbf{w}| \mathcal{D}) = \log P(\mathbf{y}| \mathbf{X},\mathbf{w}) + \log P(\mathbf{w}) + \text{const}\). Prediction averages over parameter uncertainty via the posterior predictive: \begin{equation} P(y^{\star}| \mathbf{x}^{\star},\mathcal{D}) = \int p(y^\star| \mathbf{x}^{\star},\mathbf{w})\, P(\mathbf{w}\mid \mathcal{D})\, \mathrm{d}\mathbf{w}, \end{equation} which embodies the Bayesian “weighted averaging” described above.

Bayesian procedures offer a number of well‐known benefits over many alternative learning strategies. First, because it penalizes unnecessarily elaborate models, the framework implements an intrinsic form of Occam’s razor, thereby facilitating principled model selection (Bishop, 2006; MacKay, 2003). Second, assuming that an appropriate model class has been chosen, Bayesian inference retains the full distribution of plausible parameter values instead of collapsing uncertainty into a single point estimate; this built-in accounting for parameter variability often translates into more reliable out-of-sample predictions. Third, expert domain knowledge can be incorporated in a transparent way via the choice of prior, which is particularly advantageous when only limited training data are available.

These theoretical virtues come at a computational price. Exact posteriors typically require evaluating high-dimensional integrals or combinatorial sums, which are rarely feasible for realistic problems (Bishop, 2006; MacKay, 2003). Markov Chain Monte Carlo (MCMC) methods (Neal, 1993) sidestep the analytic intractability by drawing samples from a carefully constructed chain whose stationary distribution equals the desired posterior. However, long chains and careful convergence diagnostics render MCMC expensive in practice. A variety of faster, deterministic approximations—most notably variational techniques (Jaakkola, 2001; Minka, 2001)—replace the true posterior with a simpler, tractable family of distributions, though their applicability may be limited and certain parameter subsets can remain difficult to approximate. In such cases one often resorts to Type-II maximum likelihood (empirical Bayes), which optimizes those hard-to-integrate parameters by repeatedly invoking the approximate inference algorithm, thereby driving up the total training cost (Bishop, 2006).

Bayesian Toy Study: Polynomial Regression

image image
image

Figure 1.2: Interplay of model capacity, prior strength, and sample size in Bayesian polynomial regression. Each panel shows the predictive mean (orange), a 95% credible band (shaded), ten posterior sample curves (thin grey), the noisy observations (blue dots), and the ground‑truth function \(y=\sin(2\pi x)\) (dashed blue). Top left: a twelfth‑degree polynomial under an ultra‑weak weight prior (\(\alpha=0.001\)) over‑fits dramatically, with oscillations between the sparse data points. Top right: restoring the ninth‑degree model but tightening the prior to \(\alpha=100\) shrinks the coefficients toward zero; the predictive collapses toward a near‑flat line and the credible band narrows markedly, illustrating prior dominance. Bottom: keeping the weak prior and ninth‑degree model but increasing the sample size to \(N=50\) lets the likelihood override the prior; oscillations diminish and uncertainty contracts wherever data are abundant, demonstrating how evidence tempers over‑fitting and reduces epistemic risk.

To visualize how Bayesian learning balances model complexity, prior beliefs, and data volume, the previous sinusoidal regression experiment is revisited under a Bayesian linear model; see Figure 1.2. Details on statistical properties, distributions and derivations can be found in Chapter 2. A zero-mean isotropic Gaussian prior encodes beliefs about coefficient magnitudes, \begin{equation} P(\mathbf{w}|\alpha)=\mathcal{N}(\mathbf{w}| \mathbf{0},\alpha^{-1}\mathbf{I}), \end{equation} where \(\alpha>0\) (the prior precision) controls the degree of shrinkage toward \(0\). With Gaussian observations, Bayes’ Theorem yields a Gaussian posterior: \begin{equation} P(\mathbf{w}|\mathbf{y},\alpha,\sigma) =\mathcal{N}\bigl(\mathbf{w}| \mathbf{m},\mathbf{S}_N\bigr),\quad \mathbf{S}^{-1}=\alpha \mathbf{I}+\sigma^{-2} \bm{\Phi}^{\top}\bm{\Phi},\quad \mathbf{m}=\sigma^{-2} \mathbf{S} \bm{\Phi}^{\top}\mathbf{y}. \end{equation} For a new input \(x^{\ast}\), the predictive distribution integrates parameter uncertainty: \begin{equation} P(y^{\star}| x^{\star},\mathcal{D},\alpha,\sigma) =\mathcal{N}\bigl(y^{\star}|\boldsymbol{\phi}(x^{\star})^{\top}\mathbf{m}, \sigma^{2}+\boldsymbol{\phi}(x^{\star})^{\top}\mathbf{S}\boldsymbol{\phi}(x^{\star} \bigr)\bigr)\,. \end{equation} The variance decomposes into irreducible (aleatoric) noise \(\sigma^{2}\) and an epistemic term \(\boldsymbol{\phi}(x^{\ast})^{\top}\mathbf{S}_N\boldsymbol{\phi}(x^{\ast})\) that shrinks as informative data accumulate or as the prior becomes more concentrated. The following settings highlight canonical Bayesian behaviors:

  1. High capacity with a weak prior: a degree-\(12\) polynomial with \(\alpha=10^{-3}\) (weak shrinkage). Excess flexibility combined with a lax prior allows the model to track noise, producing oscillatory fits and large predictive uncertainty—an instance of Bayesian over-fitting under weak regularization.

  2. Strong prior under limited data: a degree-\(9\) polynomial with \(\alpha=100\) (strong shrinkage). The prior heavily down-weights large coefficients, yielding a smoother fit but also noticeable bias when data are scarce; the prior dominates the likelihood, resulting in under-fitting.

  3. More data with the same strong prior: the same degree-\(9\) model and \(\alpha=100\), but trained on \(5\) times as many samples. The increased sample size strengthens the likelihood relative to the prior, taming oscillations and reducing epistemic uncertainty; the posterior mean tracks the sinusoid more closely and the predictive bands contract.

The posterior covariance \(\mathbf{S}\) and its induced predictive variance quantify the interplay between model capacity, prior strength, and data volume. Weak priors paired with high-capacity models can yield posterior distributions that admit wiggly functions consistent with noisy samples. Strong priors can curb such behavior but risk high bias when data are limited. Increasing \(N\) concentrates the posterior around functions supported by the data, causing the likelihood to outweigh the prior and improving generalization while shrinking uncertainty. These effects are visible in Figure 1.2 through the smoothness of the posterior mean and the width of the predictive intervals.

Despite the elegance and conceptual appeal of Bayesian learning, applying it to modern deep networks presents major challenges. Exact inference is intractable, and existing approximations often collapse uncertainty or become computationally prohibitive. During this thesis, these issues motivated the exploration of scalable alternatives that preserve the core Bayesian principles of uncertainty quantification and regularization. In particular, this research has contributed new approaches that reformulate inference directly in function space—avoiding ill-posed parameter posteriors—and methods that retrofit uncertainty around pre-trained deterministic models without retraining. Together, these advances bring Bayesian reasoning closer to practical deployment in modern architectures, enabling calibrated predictions, better uncertainty estimates, and deeper theoretical understanding of Bayesian deep learning.

Generalization

A predictive model is regarded as useful only insofar as it generalizes, i.e. maintains high predictive accuracy on data that were not available during training (Bishop, 2006; Vapnik, 1998). The quest to understand and to control generalization therefore sits at the heart of statistical learning theory. Classical analyses treat the training sample as a random draw from an unknown generating distribution; learning is then framed as choosing, from a hypothesis class \(\mathcal{F}\), a function whose expected risk is close to the minimum achievable risk within \(\mathcal{F}\) (Shalev-Shwartz & Ben-David, 2014). Guarantees are expressed in terms of capacity measures such as the Vapnik–Chervonenkis (VC) dimension, Rademacher complexity, or covering numbers, all of which quantify how readily the hypothesis class can fit random noise.

From a practical standpoint, limited data and noisy observations translate into a bias–variance trade-off (Hastie et al., 2009). If the hypothesis space is too restrictive, the learned predictor exhibits high bias, failing to capture important structure; if the space is overly flexible, it possesses high variance, adapting to incidental fluctuations in the training set and thereby overfitting. Modern deep networks add a further twist: they can possess more parameters than training samples and still generalize well, a phenomenon sometimes called the “double descent” curve (Nakkiran et al., 2020; Zhang et al., 2017). Understanding these regimes has become an active area of contemporary research.

A repertoire of algorithmic techniques has emerged to foster generalization in practice. Regularization—whether explicit, as in \(\ell_2\) or \(\ell_1\) penalties, or implicit, as in the stochasticity of mini-batch gradient descent—constrains the effective capacity of the model. Data augmentation synthetically enlarges the sample, while cross-validation supplies an unbiased estimate of out-of-sample error for model selection (Bishop, 2006). Ensemble methods such as bagging and boosting aggregate multiple hypotheses, often reducing variance without a commensurate increase in bias. Collectively, these ideas form the methodological backbone that allows contemporary machine learning systems to move beyond the training corpus and perform robustly in the real world.

image image

Figure 1.3: Two canonical generalization landscapes obtained from synthetic experiments. Left: Average test mean–squared error (MSE) when fitting polynomials of degree \(0\)\(10\) to \(N_{\text{train}}=40\) noisy samples from \(f(x)=\sin(2\pi x)\) (\(\sigma=0.1\)). The characteristic U‑shape illustrates the classical bias–variance trade‑off: low‑capacity models under‑fit (high bias), while high‑capacity models over‑fit (high variance). Right: Test MSE for min‑norm linear regression with \(p\) random Gaussian features and a fixed training set of \(N_{\text{train}}=50\) examples corrupted by strong noise (\(\sigma=2\)). The pronounced spike exactly at \(p=N_{\text{train}}\) marks the interpolation threshold; error falls again for \(p\!>\!N_{\text{train}}\), producing the modern double‑descent curve. Each point aggregates the results of 1 000 (left) or 200 (right) independent Monte‑Carlo repeats.
A unifying visual example (Figure 1.3).

In the classical setting (left panel) test error follows the familiar U‑curve: as model capacity grows, bias falls but variance rises, producing an optimal “sweet spot”. In stark contrast, the right‑hand panel shows that over‑parameterized learners can suffer a sharp risk explosion at the interpolation threshold and yet recover low error once capacity increases further—the so‑called double‑descent phenomenon that has reignited interest in modern capacity measures. These contrasting behaviors underscore the need for a theory that simultaneously accounts for bias, variance, and the surprising generalization of highly over‑parameterized models.

Despite substantial progress, fundamental questions remain open: what precise properties of optimization algorithms and network architectures explain the empirical generalization behavior of over-parameterized models? How can one derive capacity measures that capture those properties yet remain analytically tractable? This thesis contributes to that line of inquiry by proposing a novel theoretical framework that explains the relationship between ensemble generalization and diversity. Furthermore, a smoothness-based complexity term that remains non-vacuous even in the interpolation regime is introduced, allowing one to create a unifying explanation of a wide range of learning techniques used in modern machine learning.

Contributions

  1. A Bayesian perspective for modern deep learning
    Motivation and scope. State-of-the-art deep networks are typically deployed as deterministic predictors that output point estimates. In many settings—risk-aware decision making, clinical or scientific inference, autonomous systems, and active data acquisition—the object of interest is a predictive distribution that quantifies both epistemic and aleatoric uncertainty. Obtaining such distributions for modern architectures is difficult for three reasons. First, exact Bayesian posteriors over the parameters of deep networks are analytically intractable and computationally prohibitive to approximate at scale. Second, common approximations (e.g., simple variational families or naïve ensembles) can be miscalibrated, sensitive to optimization, and degrade accuracy if applied end-to-end. Third, parameter-space posteriors are often ill-posed surrogates for predictive uncertainty because many distinct parameter values induce near-identical functions, making inference geometry unfavorable.

    Approach and main results. The difficulties outlined above are addressed by adopting a function–space formulation of learning and by developing inference procedures that remain tractable at modern scales.

    • Function-space variational inference via Deep Variational Implicit Processes (DVIP). Deep networks are represented as compositions of implicit processes, which induces a stochastic process prior over functions without requiring explicit parametric densities in weight space. A variational family is specified directly over functions; an evidence lower bound is derived in function space; and unbiased stochastic estimators are constructed using pathwise sampling through the implicit layers. This yields calibrated predictive distributions and competitive accuracy while avoiding pathologies of parameter-space posteriors. (Chapter 3)

    • Diversity-aware ensemble objectives with theoretical support. An upper bound on ensemble risk is proved that decomposes into (i) the \(\rho\)-average member error and (ii) a negative diversity term measuring disagreement among members’ predictions. A PAC–Bayesian analysis then links the empirical diversity term to generalization error, yielding a regularized training objective that promotes beneficial diversity while preserving consistency guarantees across regression and classification losses. (Chapter 5)

    Collectively, these developments provide routes to calibrated predictive distributions: either by performing inference directly in function space (DVIP) or by training ensembles with a diversity term justified by generalization bounds.

  2. Post-hoc uncertainty for pre-trained networks
    Motivation and scope. In practical pipelines, a high-performing deterministic backbone is often already available after substantial engineering and training effort. Replacing this model with a bespoke Bayesian architecture is undesirable due to engineering cost, potential accuracy loss, and computational expense. The central need is a post-hoc mechanism that (i) attaches calibrated predictive uncertainty to an existing model without end-to-end retraining, (ii) preserves the backbone’s accuracy and inductive biases, and (iii) scales to modern datasets and architectures (e.g., ImageNet-class problems and large molecular libraries). Traditional posterior approximations rarely meet these constraints simultaneously: Markov chain Monte Carlo is generally too slow, Laplace approximations may be local and unstable, and naïve temperature scaling or dropout-based methods can be poorly calibrated, especially under distribution shift. Methods are needed that retrofit uncertainty in a principled way while controlling computational and statistical trade-offs.

    Approach and main results. Two post-hoc methodologies are introduced that attach Bayesian uncertainty to existing predictors while preserving accuracy and controlling computational cost.

    • Variational Linearized Laplace Approximation (VaLLA). A linearized Laplace approximation is taken around a pre-trained network, producing a local Gaussian posterior in function space whose covariance encodes curvature information. This covariance is reinterpreted as a kernel for a sparse variational Gaussian process with decoupled inducing variables. The surrogate GP is trained to match the local behavior of the backbone, so that the predictive mean aligns with the base model while the predictive variance reflects local epistemic uncertainty.(Chapter 4.2)

    • Fixed-Mean Gaussian Processes (FMGPs). The mean of the GP is fixed to the output of a pre-trained network; covariance hyperparameters (and inducing variables) are learned by maximizing a variational bound. This decoupling simplifies optimization, preserves the backbone’s inductive bias, and enables transparent control of uncertainty through kernel choice. Predictive intervals are computed in closed form and remain well-calibrated under moderate distribution shift. Large-scale experiments demonstrate feasibility and strong calibration in computer vision and chem-informatics benchmarks. (Chapter 4.3)

    Both approaches are architecture-agnostic, require no end-to-end retraining of the backbone, and provide calibrated uncertainty suitable for downstream decision-making under risk.

  3. Distribution-dependent generalization for interpolating models
    Motivation and scope. Over-parameterized models that interpolate the training data routinely generalize well and can exhibit double-descent risk curves, yet classical uniform-convergence bounds (e.g., VC/Rademacher) often become vacuous in these regimes. Two gaps arise. First, capacity-only bounds ignore structure in the data-generating distribution that strongly influences generalization, leading to bounds that fail to tighten with increasing sample size. Second, widely used practices—\(\ell_2\) penalties, distance-from-initialization and input-gradient constraints, architectural invariances, data augmentation, and ensemble diversity—lack a unified theory that quantifies when they reduce test error at or beyond the interpolation threshold. A distribution-dependent framework is needed that (i) remains informative for interpolators, (ii) yields a complexity measure aligned with observed concentration of empirical losses, and (iii) explains why and when added capacity or induced invariances improve generalization rather than harm it.

    Approach and main results. A distribution-dependent analysis is developed that remains informative at interpolation and unifies several empirical practices through a single complexity measure.

    • PAC–Chernoff bounds that remain tight at interpolation. A bound is derived that upper-bounds the population risk in terms of the inverse of a rate function governing the large-deviation behavior of the empirical loss. Unlike uniform-convergence bounds, the resulting guarantee depends on the data-generating distribution and is exact (perfectly tight) for any interpolator. This property holds even in over-parameterized regimes, thereby covering models at or beyond the interpolation threshold. The rate function quantitatively captures the concentration of the empirical loss and explains observed double-descent curves. (Chapter 5.3)

    • Smoothness via the rate function and consequences for regularization. The inverse rate function is used as a smoothness measure for interpolators. It is shown that standard regularizers—\(\ell_2\) penalties, constraints on distance from initialization, and control of input-gradient norms—act as proxies that decrease the inverse rate function, thereby lowering the bound on test error at interpolation. (Chapter 5.3.6)

    • Data augmentation and architectural invariances. The analysis shows that augmentation schemes can increase the rate function (i.e., increase concentration of the empirical loss), yielding smaller generalization error for interpolators under the PAC–Chernoff bound. Analogously, architectures with built-in invariances (e.g., convolutional networks for translations) are shown to avoid the concentration degradation induced by transformed inputs, again improving the bound. Conditions are identified under which these mechanisms are effective and when departures from ideal group structure still confer benefits. (Chapter 5.3.7)

    • Over-parameterization requirements and double descent. Conditions are derived under which additional parameters are required to realize smoother interpolators (in the rate-function sense). This explains why, beyond the interpolation threshold, enlarging the model can reduce test error and accounts for the decreasing branch of double descent through improved concentration properties captured by the rate function. (Chapters 5.3.8 and Section 5.3.5)

    • Implicit bias of stochastic gradient descent. A large-deviation perspective is introduced to analyze the implicit bias of stochastic gradient descent (SGD). It is shown that the stochasticity inherent to mini-batch updates induces an implicit regularization effect, promoting convergence toward flat minima that exhibit higher concentration of the empirical loss and, consequently, better generalization. The analysis connects the noise statistics of SGD to the rate function of the loss landscape, providing a distribution-dependent explanation of why SGD tends to favor solutions with superior generalization properties. (Chapter 5.4)

    The resulting framework specifies when interpolators generalize, provides theory-backed training objectives (including diversity-aware ensembles), and guides the design of regularization, augmentation, and architectural inductive biases that improve test performance at and beyond interpolation.

Full list of Publications

This section provides a comprehensive list of the publications produced during my PhD.

Main publications

In this subsection, I present the principal articles that form the core contributions of this dissertation.

  1. Ortega, L.A., Cabañas, R. &, Masegosa, A.. (2022). Diversity and Generalization in Neural Network Ensembles. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (CORE A), in Proceedings of Machine Learning Research 151:11720-11743. Available from https://proceedings.mlr.press/v151/ortega22a.html.
    Detailed in Chapter 5.

  2. Ortega, L.A., Rodríguez-Santana, S., & Hernández-Lobato, D.. (2023). Deep Variational Implicit Processes. The Eleventh International Conference on Learning Representations (CORE A+). Available from https://iclr.cc/virtual/2023/poster/12053.
    Detailed in Chapter 3.

  3. Ortega, L.A., Rodriguez Santana, S., & Hernández-Lobato, D.. (2024). Variational Linearized Laplace Approximation for Bayesian Deep Learning. Proceedings of the 41st International Conference on Machine Learning (CORE A+), in Proceedings of Machine Learning Research 235:38815-38836. Available from https://proceedings.mlr.press/v235/ortega24a.html.
    Detailed in Chapter 4.2.

  4. Masegosa, A., & Ortega, L.A.. (2025). PAC-Chernoff Bounds: Understanding Generalization in the Interpolation Regime. Journal of Artificial Intelligence Research (Q2). Presented at the European Conference on Artificial Intelligence (CORE A) as a spotlight presentation. Available from https://doi.org/10.1613/jair.1.17036.
    Detailed in Chapter 5.3.

  5. Ortega, L.A., Rodriguez Santana, S., & Hernández-Lobato, D.. (2025). Fixed-mean Gaussian Processes for ad-hoc Bayesian Deep Learning. Submitted to Journal on Machine Learning Research. Available from https://arxiv.org/abs/2412.04177.
    Detailed in Chapter 4.3.

  6. Ortega, L.A. & Masegosa, A.. (2025). A Large Deviation Theory Analysis on the Implicit Bias of SGD. Neurocomputing (Q1). Available from https://www.sciencedirect.com/science/article/abs/pii/S0925231226003577.
    Detailed in Chapter 5.4.

Other Publications

In this subsection, I present other publications produced during my PhD that are not directly included as part of this dissertation.

  1. Zhang, Y., Wu, Y., Ortega, L.A., & Masegosa, A.. (2024). The Cold Posterior Effect Indicates Underfitting, and Cold Posteriors Represent a Fully Bayesian Method to Mitigate It. Transactions for Machine Learning Research. Available from https://openreview.net/forum?id=GZORXGxHHT.

  2. Casado, I., Ortega, L.A., Pérez, A., & Masegosa, A.. (2024). PAC-Bayes-Chernoff bounds for unbounded losses. The Thirty-eighth Annual Conference on Neural Information Processing Systems (CORE A+). Available from https://openreview.net/forum?id=CyzZeND3LB.

  3. Ortega, L.A., Rodriguez Santana, S., & Hernández-Lobato, D.. (2025). Scalable Linearized Laplace Approximation via Surrogate Neural Kernel. Submitted to ESANN 2026 (CORE B).

Chapters Summary

Chapter 2 – Bayesian Inference, Gaussian Processes and Generalization Bounds. This chapter develops the mathematical and probabilistic foundations required for the thesis. It begins with a measure-theoretic treatment of probability, introducing \(\sigma\)-algebras, probability measures, conditional probability, independence, and random variables. Bayesian inference is then presented as a principled framework for uncertainty estimation, with emphasis on approximate inference techniques such as variational inference, the mean-field family, and the black-box \(\alpha\)-energy objective. Gaussian processes (GPs) are introduced as non-parametric Bayesian models for functions, covering their predictive distributions, hyper-parameter learning, scalability via inducing-point approximations and random features, and a dual interpretation through Gaussian measures in Hilbert spaces. The chapter concludes with generalization theory in the form of Probably Approximately Correct (PAC) bounds, contrasting classical uniform convergence results with the more general PAC–Bayesian framework, which provides distribution-dependent guarantees. Together, these elements establish the theoretical backbone for the subsequent chapters on scalable Bayesian deep learning and generalization in modern neural networks.

Chapter 3 – Deep Variational Implicit Processes. This chapter introduces Deep Variational Implicit Processes (DVIPs), a flexible Bayesian framework for uncertainty-aware deep learning. Building upon the concept of implicit processes (IPs), which generalize Gaussian processes by defining priors through “samplable” but density-free stochastic processes, DVIPs extend variational implicit processes (VIPs) to multi-layer hierarchies. Each layer employs IP-based priors approximated by GPs, enabling scalable variational inference with Monte Carlo sampling and stochastic optimization. Unlike VIPs, DVIPs yield non-Gaussian predictive distributions, improving flexibility and calibration. Extensive experiments on regression benchmarks, image classification, and large-scale datasets demonstrate that DVIPs match or surpass deep GPs while being more computationally efficient. The chapter highlights the role of depth, prior adaptation, and domain-specific priors (e.g., CNN-based) in enhancing expressivity and predictive performance.

Chapter 4 – Post-hoc Uncertainty Estimation for Pre-trained Networks. This chapter addresses post-hoc uncertainty estimation for pre-trained deep networks through two complementary approaches: the Variational Linearized Laplace Approximation (VaLLA) and Fixed-Mean Gaussian Processes (FMGPs). VaLLA reformulates the linearized Laplace approximation (LLA) in function space, recasting its covariance as a GP kernel and enabling scalable surrogates with sparse variational GPs. FMGPs take this further by treating any pre-trained model as the deterministic mean of a GP, training only the covariance while keeping the mean fixed, thereby converting deterministic predictors into calibrated Bayesian ones. Both methods allow uncertainty estimation without retraining the original network and scale to large datasets such as ImageNet or molecular property prediction tasks. Experiments confirm that these approaches achieve competitive calibration with modest computational overhead, establishing practical recipes for retrofitting uncertainty into modern deep learning models.

Chapter 5 – Generalization in Neural Networks. Brings together two complementary lines of research that tackle why modern deep-learning models can generalize even when they contain far more parameters than training points.

  • Chapter 5. Although practitioners know that ensembles work best when their members err differently, there has been no consensus definition of “diversity” nor a theory that links it to test error in deep nets. A unified diversity measure is introduced and, via an upper‑bound decomposition, show how ensemble error splits into two terms: average individual error and a negative contribution from diversity. A subsequent PAC‑Bayesian analysis makes this link rigorous and distribution‑dependent, revealing how correlation between members controls the bound. Experiments on CIFAR‑10/100 classifiers and a Wine‑Quality regressor confirm that higher measured diversity correlates with larger “gap” between the mean member loss and the ensemble loss, exactly as the theory predicts.

  • Chapter 5.3. Classic VC‑ or Rademacher‑style bounds become vacuous for modern, over‑parameterized interpolators. The chapter, therefore, seeks bounds that depend on the data‑generating distribution rather than the finite training sample. The result is tight for any model that perfectly fits the training set. It reproduces the double‑descent curve by showing how widening a network can increase the rate function after the interpolation threshold, thus lowering test error.

    Classical regularizers (\(\ell_2\) decay, distance‑from‑init, input‑gradient penalties) as well as data‑augmentation and invariant architectures all act by enlarging the rate function, i.e. making the model smoother.

  • Chapter 5.4. Explores the implicit bias of stochastic gradient descent (SGD) through the lens of large-deviation theory. It demonstrates that the stochasticity of mini-batch updates induces an implicit regularization effect that biases learning toward flatter minima with higher loss concentration, thereby improving generalization. The analysis connects the noise covariance of SGD to the rate function governing the loss landscape, offering a distribution-dependent explanation of why SGD tends to converge to solutions with superior generalization performance and how batch size, learning rate, and noise structure influence this behaviour.

Appendix A — Miscellaneous. This section provides the precise synthetic setups behind Figure 1.3 and collects closed-form Gaussian identities that will be used later in the thesis.

Appendix B — Deep Variational Implicit Processes: full regression results. Gathers the complete UCI regression tables and additional figures underpinning Chapter 3 (VIP, DVIP with varying depth, sparse GPs, and deep GPs). Exact values referenced in the chapter as Table B.1 are reported here.

Appendix C — Mathematical Proofs. Collects formal statements and proofs for all theoretical results in all the chapters of this dissertation.

Appendix DBayesiPy: Post-hoc Bayesian Inference for Pre-trained Neural Networks. Introduces BayesiPy, a Python library that wraps a trained backbone with a Bayesian posterior approximation to obtain calibrated predictive means and uncertainties without end-to-end retraining. It catalogues the included methods in the library—Linearised Laplace (LLA), Accelerated LLA (ELLA), Variational LLA (VaLLA), mean-field variational inference (MFVI), Spectral-Normalised GP (SNGP), and Fixed-Mean GP (FMGP)—and provides usage examples (e.g., FMGP over a PyTorch model), highlighting when each approach is preferable