Bayesian Inference, Gaussian Processes and Generalization Bounds
In this chapter, the basics of Bayesian inference, Gaussian processes, and theoretical bounds for the generalization error of deep learning models are introduced.
Probability–Theoretic Foundations
The statistical methodologies developed in this thesis rest on the rigorous measure–theoretic formulation of probability introduced by Kolmogorov (1956) and further elaborated in standard monographs such as Billingsley (1995) and Durrett (2019). In this chapter, the key concepts and notational conventions that will be employed throughout this dissertation are introduced. The presentation is self–contained but deliberately concise; readers seeking a comprehensive treatment are referred to the aforementioned texts as well as Klenke (2013) for additional details.
Measurable Spaces and Probability Measures
Let \(\Omega\) denote the sample space: the collection of all possible outcomes of a well–defined experiment or data–generating mechanism. A subset \(A \subseteq \Omega\) is called an event. In order to assign probabilities consistently, a suitable family of events that is stable under the usual set operations is required.
Definition 2.1 \(\sigma\)–algebra. Let \(\mathcal P(\Omega)\) be the power set of \(\Omega\). A non–empty collection \(\mathcal F \subseteq \mathcal P(\Omega)\) is called a \(\sigma\)–algebra if
\(\Omega \in \mathcal F\); (contains the whole space)
\(A \in \mathcal F \;\Rightarrow\; A^{\mathrm c} \in \mathcal F\); (closed under complementation)
\(\{A_n\}_{n \in \mathbb N} \subset \mathcal F \;\Rightarrow\; \bigcup_{n \in \mathbb N} A_n \in \mathcal F\). (closed under countable unions)
The pair \((\Omega,\mathcal F)\) is called a measurable space. It follows from the definition that \(\emptyset \in \mathcal F\) and that \(\mathcal F\) is closed under countable intersections.
Definition 2.2 Probability measure. Given a measurable space \((\Omega,\mathcal F)\), a mapping \(P:\mathcal F \to [0,1]\) is called a probability measure if
Non–negativity: \(P(A) \ge 0\) for all \(A \in \mathcal F\);
Normalization: \(P(\Omega)=1\);
Countable additivity: for any countable collection of pairwise disjoint events \(\{A_n\}_{n \in \mathbb N} \subset \mathcal F\), it verifies that \(P\big(\, \bigcup_{n \in \mathbb N} A_n\big)=\sum_{n \in \mathbb N} P(A_n)\).
The triple \((\Omega,\mathcal F,P)\) is termed a probability space. From the axioms, one immediately derives \(P(\emptyset)=0\) and the inclusion–exclusion identity \(P(A \cup B)=P(A)+P(B)-P(A \cap B)\) for all \(A,B \in \mathcal F\).
Conditional Probability and Independence
Definition 2.3 Conditional probability. Let \(A,B \in \mathcal F\) with \(P(B)>0\). The conditional probability of \(A\) given \(B\) is defined by \begin{equation} P(A| B) = \frac{P(A \cap B)}{P(B)}. \end{equation}
Theorem 2.4 Bayes’ theorem. For any \(A,B \in \mathcal F\) with \(P(B)>0\), \begin{equation} P(A| B) = \frac{P(B| A)P(A)}{P(B)}. \end{equation}
Bayes’ theorem underpins modern Bayesian statistics (Gelman et al., 2013) by enabling the systematic update of prior beliefs \(P(A)\) in the light of new evidence \(B\).
Definition 2.5 Independence. Events \(A,B \in \mathcal F\) are called independent if \(P(A \cap B)=P(A)P(B)\), equivalently \(P(A| B)=P(A)\) whenever \(P(B)>0\). More generally, \(A\) and \(B\) are conditionally independent given a third event \(C\) with \(P(C)>0\) if \(P(A \cap B| C)=P(A| C)P(B| C)\); this is noted as \(A \perp B \mid C\).
Random Variables
A random variable formalizes the idea of a numerical quantity generated by a random experiment. Following Athreya & Lahiri (2006), the measure–theoretic definition is adopted.
Definition 2.6 Random variable. Let \((\Omega,\mathcal F,P)\) be a probability space and \((E,\mathcal E)\) a measurable space. A measurable function \(X:\Omega \to E\) is called a random variable. The push–forward measure \(P_X := P \circ X^{-1}\) is the distribution of \(X\).
Throughout the thesis, random variables are denoted by uppercase letters (e.g., \(X\)), whereas realizations or deterministic values are rendered in lowercase (e.g., \(x\)). Boldface symbols (e.g., \(\mathbf{X}\)) indicate random vectors. Usually, \(\bm{\theta}\) denotes the full parameter vector (or collection of parameters), while \(\theta\) denotes a single scalar parameter—either the parameter of a single-parameter model or an individual component \(\theta_j\) of \(\bm{\theta}\).
Distribution Functions
Definition 2.7 Cumulative distribution function. For a real–valued random variable \(X\), the mapping \(F_X:\mathbb R \to [0,1]\) given by \(F_X(x)=P(X\le x)\) is called the cumulative distribution function (CDF).
If \(X\) takes values in a countable set, it is termed discrete and its probability mass function is \(p_X(x)=P(X=x)\). When \(X\) has an uncountable support and there exists a density \(f_X\) with respect to Lebesgue measure such that \(F_X(x)=\int_{-\infty}^x f_X(u)\,\mathrm{d}u\), \(X\) is said to be continuous with probability density function \(f_X\). Mixed–type variables can be represented as combinations of discrete and continuous components (Grimmett & Stirzaker, 2001).
For the rest of this thesis, the integral notation will be used uniformly for discrete and continuous cases. For discrete, the integral is understood with respect to the counting measure \(\#(\mathrm{d}x)=\sum_{n\in\mathcal I}\delta(x-n)\,\mathrm{d}x\).
Joint, Marginal, and Conditional Distributions
Let \(X\) and \(Y\) be random variables on the same probability space. The joint distribution \(P_{X,Y}\) is defined on the product \(\sigma\)–algebra and satisfies \(P_{X,Y}(A\times B)=P(X\in A, Y\in B)\). The marginal of \(X\) is obtained by integration (or summation) over \(Y\): \begin{equation} P_X(A)=\int_{y} P_{X,Y}(A,\mathrm{d}y). \end{equation} Conditional distributions \(P_{X\mid Y}\) are defined analogously whenever \(P_Y\) assigns positive probability to the conditioning event or set. Two random variables are independent if \(P_{X,Y}=P_X P_Y\); conditional independence given a third variable \(Z\) is denoted by \(X\perp Y\mid Z\) and is central to graphical–model semantics (Lauritzen, 1996).
A sequence \(\{X_n\}_{n=1}^N\) is said to be independent and identically distributed (i.i.d.) if the joint CDF factorizes \(F_{X_1,\dots,X_N}(x_1,\dots,x_N)=\prod_{n=1}^N F_{X_1}(x_n)\). Random vectors are collected into column form \(\mathbf X=(X_1,\dots,X_N)^T\); no ambiguity should arise between vector notation and sets of variables.
Bayesian Inference
Effective generalization and reliable uncertainty estimation remain pivotal challenges in contemporary machine learning research. Bayesian inference addresses these challenges by systematically incorporating uncertainty through probabilistic modeling (Bishop, 2006; MacKay, 2003). In Bayesian supervised learning, observed data are linked to model parameters via a likelihood function, while prior distributions quantify initial beliefs about these parameters. Using Bayes’ theorem, these beliefs are updated in light of empirical evidence, yielding posterior distributions over parameters. However, exact Bayesian inference is generally infeasible for complex models, prompting the adoption of approximate methods (Blei et al., 2017; Neal, 1993).
In contemporary machine learning, a Bayesian interpretation of probability offers a principled way to capture both our ignorance about the true model and the uncertainty in its parameters. Consider a model governed by parameters \(\bm{\theta}\) and a training set \(\mathcal{D}\). Equipped with a prior distribution \(P(\bm{\theta})\), Bayes’ rule transforms this prior into the posterior distribution that incorporates the information contained in the data: \begin{equation} P(\bm{\theta}|\mathcal{D}) = \frac{P(\mathcal{D}|\bm{\theta})\, P(\bm{\theta})} {P(\mathcal{D})}\,, \end{equation} where \(P(\mathcal{D}|\bm{\theta})\) is the likelihood, measuring how well a particular choice of parameters explains the observed data, and \begin{equation} P(\mathcal{D})=\int P(\mathcal{D}|\bm{\theta}) P(\bm{\theta}) \,\mathrm{d}\bm{\theta}\,, \end{equation} is the marginal likelihood (or evidence) ensuring that the posterior is a proper probability distribution.
Beyond parameter estimation
Once computed, the posterior can serve multiple roles: it yields predictive distributions for new data, supplies calibrated credible intervals, and underpins hierarchical extensions in which additional latent variables are introduced to simplify computation or enrich the model (Bishop, 2006; MacKay, 2003).
Bayesian vs. Frequentist Viewpoints
This Bayesian treatment differs fundamentally from a frequentist approach (Hastie et al., 2001). In frequentist statistics, the parameters are considered fixed but unknown; one first produces a point estimate and then quantifies its sampling variability. In the Bayesian framework, by contrast, \(\bm{\theta}\) is regarded as uncertain, and the entire posterior expresses our updated beliefs after observing \(\mathcal{D}\). Importantly, assigning a distribution to \(\bm{\theta}\) is a statement about epistemic uncertainty, not a claim that the parameters physically fluctuate at random.
Epistemic (model) uncertainty arises from limited data, incomplete knowledge, or model misspecification; it reflects uncertainty about the parameters or the hypothesis class and is, in principle, reducible with additional informative data or a better model. It tends to be large in regions of the input space that are underrepresented (including out-of-distribution inputs) and is commonly quantified by variability across plausible models, e.g., Bayesian posterior dispersion or the spread of predictions from deep ensembles. By contrast, aleatoric (data) uncertainty stems from intrinsic randomness in the data-generating process—such as measurement noise, label noise, or inherent class overlap—and is irreducible even with infinite data. Aleatoric uncertainty may be homoscedastic (constant across inputs) or heteroscedastic (input-dependent), and is typically modeled via an explicit noise/variance term in the likelihood or through distributional/quantile losses. Conceptually, total predictive uncertainty decomposes into these two components (via the law of total variance): epistemic captures uncertainty about the model, while aleatoric captures uncertainty in the observations themselves.
The Bayesian treatment of a simple Bernoulli process is described next. Let \(\mathcal{D}\) denote a dataset of \(n\) independent coin tosses \(\{x_{1},\dots,x_{n}\}\), where each \(x_{i}\in\{0,1\}\) equals \(1\) (“heads”) with probability \(\theta\) and \(0\) (“tails”) otherwise. The likelihood of the entire sequence, conditioned on the fixed but unknown bias of the coin \(\theta\), is \begin{equation} P(\mathcal{D}|\theta) = \prod_{i=1}^{n} \theta^{x_{i}}\,(1-\theta)^{1-x_{i}} = \theta^{k}\,(1-\theta)^{n-k}, \end{equation} where \(k=\sum_{i=1}^{n}x_{i}\) is the number of heads.
A prior distribution is said to be conjugate to a likelihood family when, after observing any data sample, the resulting posterior distribution belongs to the same parametric family as the prior, differing only in its hyper-parameters. A conjugate prior for \(\theta\in[0,1]\) is the Beta distribution, \(\theta\sim\mathrm{Beta}(\alpha_{0},\beta_{0})\), with density proportional to \(\theta^{\alpha_{0}-1}(1-\theta)^{\beta_{0}-1}.\) The hyper-parameters \(\alpha_{0}\) and \(\beta_{0}\) encode prior pseudo-counts of heads and tails respectively; choosing \(\alpha_{0}=\beta_{0}=1\) yields a uniform prior. By Bayes’ theorem, the posterior is \begin{equation} P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)\,P(\theta)}{P(\mathcal{D})} = \mathrm{Beta}\,\bigl(\alpha_{0}+k,\;\beta_{0}+n-k\bigr)\,. \label{eq:coin_posterior} \end{equation} From Equation \(\eqref{eq:coin_posterior}\), closed-form expressions for the first two moments can be obtained: \begin{equation} \mathbb{E}[\theta|\mathcal{D}]=\frac{\alpha_{0}+k}{\alpha_{0}+\beta_{0}+n}, \quad \mathrm{Var}[\theta|\mathcal{D}] = \frac{(\alpha_{0}+k)(\beta_{0}+n-k)} {(\alpha_{0}+\beta_{0}+n)^{2}\,(\alpha_{0}+\beta_{0}+n+1)}\,. \end{equation} See Figure 2.1 for an illustration on how the prior distribution is updated into the posterior. Equation \(\eqref{eq:coin_posterior}\) shows that each observed head increments the shape parameter \(\alpha_{0}\), while each tail increments \(\beta_{0}\), updating prior pseudo-counts with empirical counts. Consequently,
as \(n\to\infty\) the posterior mean tends to the empirical proportion \(k/n\), and the variance \(\mathrm{Var}[\theta|\mathcal{D}]\to 0\), increasing certainty about \(\theta\);
for small samples, the prior exerts appreciable influence, pulling the posterior toward the prior mean \(\alpha_{0}/(\alpha_{0}+\beta_{0})\) and preventing over-confident estimates when \(k\) is near \(0\) or \(n\);
the additive form of Equation \(\eqref{eq:coin_posterior}\) reveals that the Beta prior contributes exactly \(\alpha_{0}+\beta_{0}\) virtual observations, providing an intuitive interpretation of the hyper-parameters as prior evidence.
With the uniform prior \((\alpha_{0},\beta_{0})=(1,1)\), the posterior reduces to \(\mathrm{Beta}(1+k,1+n-k)\); for \(n=0\) this coincides with the prior, reflecting complete ignorance before any data are observed. This conjugate Beta–Binomial model affords analytical tractability, facilitating closed-form credible intervals, model comparison via the marginal likelihood \(P(\mathcal{D})=\mathrm{Beta}(\alpha_{0}+k,\beta_{0}+n-k)/ \mathrm{Beta}(\alpha_{0},\beta_{0})\), and principled incorporation of expert knowledge through the choice of \((\alpha_{0},\beta_{0})\).
Given \(n\) Bernoulli trials with \(k\) heads, the frequentist treats the success probability \(\theta\) as a fixed but unknown constant and employs maximum-likelihood estimation: \begin{equation} \hat{\theta}_{\text{MLE}}=\frac{k}{n}\,. \end{equation} Sampling uncertainty is quantified via a confidence interval. Using the (asymptotic) Wald interval, one obtains \begin{equation} \hat{\theta}_{\text{MLE}} \;\pm\; z_{1-\alpha/2}\, \sqrt{\frac{\hat{\theta}_{\text{MLE}}\, (1-\hat{\theta}_{\text{MLE}})}{n}}, \end{equation} where \(z_{1-\alpha/2}\) is the standard–normal quantile (e.g. \(1.96\) for \(95\%\) confidence). Exact alternatives such as the Clopper–Pearson or Wilson intervals avoid the poor small‐sample performance of the Wald form but require numerical evaluation.
The fundamental differences between the frequentist and Bayesian paradigms can be summarized as follows:
Ontological stance: the frequentist regards \(\theta\) as fixed and randomness arises solely from repeated sampling; the Bayesian treats \(\theta\) itself as random, attributing epistemic uncertainty to it through the Beta prior.
Estimation: both approaches yield point estimates that converge to \(k/n\) as \(n\to\infty\), but the Bayesian posterior mean \(\mathbb{E}[\theta|\mathcal{D}] ={(\alpha_{0}+k)}/{(\alpha_{0}+\beta_{0}+n)}\) is a shrinkage estimator, drawing extreme sample proportions toward the prior mean when \(n\) is small.
Uncertainty quantification: a \(100(1-\alpha)\%\) confidence interval contains the true \(\theta\) in repeated experiments with long‐run frequency \(1-\alpha\); a \(100(1-\alpha)\%\) credible interval from the Beta posterior assigns probability \(1-\alpha\) to the statement “\(\theta\) lies in this set” conditioned on the observed data. The interpretive gap is especially evident for small samples, where the frequentist interval may be empty or degenerate (e.g. \(k=0\) or \(k=n\)), whereas the Bayesian interval remains well‐behaved via the prior’s pseudo‐counts.
Role of prior information: frequentist inference eschews prior knowledge, aiming for procedures with guaranteed coverage irrespective of subjective beliefs; the Bayesian framework naturally incorporates expert opinion or previous studies through \((\alpha_{0},\beta_{0})\), with transparent impact that diminishes as \(n\) grows.
Computational tractability: while both the Bayesian and frequentist analyses are analytically solvable for the simple Bernoulli–Binomial model, this tractability does not extend to more realistic problems. Bayesian inference is only analytically feasible in conjugate or highly simplified cases; in general, it requires approximate or sampling-based methods such as variational inference or MCMC, which are substantially more computationally demanding than frequentist estimators.
Model comparison and prediction: the Bayesian marginal likelihood, \begin{equation} P(\mathcal{D}) = \mathrm{Beta}(\alpha_{0}+k,\beta_{0}+n-k)/ \mathrm{Beta}(\alpha_{0},\beta_{0})\,, \end{equation} furnishes a principled basis for model selection and naturally integrates parameter uncertainty into predictive distributions. Frequentist counterparts often rely on information criteria (AIC, BIC) or cross‐validation, treating fitted parameters as point estimates.
These differences exemplify the broader philosophical and methodological divergence between Bayesian and frequentist paradigms while also highlighting their agreement in large‐sample limits, where prior influence fades and both approaches tend toward the empirical proportion \(k/n\).
Variational Inference
Modern Bayesian models in machine learning typically involve high-dimensional parameter spaces, non-conjugate likelihoods, complex latent structures, and large datasets. In such settings, the posterior distribution is rarely available in closed form, and exact marginalization requires integrals that are analytically intractable and computationally prohibitive. Posterior landscapes are often multi-modal and strongly correlated, further complicating integration and rendering naive quadrature or exact dynamic programming infeasible. These obstacles motivate approximate inference methods that deliver tractable surrogates to the true posterior while respecting the computational constraints of modern practice (e.g., mini-batch training, accelerators, and streaming data) (Bishop, 2006; Blei et al., 2017; MacKay, 2003).
Approximate Bayesian inference is often required because exact posteriors are analytically intractable for most realistic models. A widely used deterministic approach is Variational Inference (VI) (Jordan, 1999), which recasts inference as an optimization problem. VI approximates the true posterior with a simpler, tractable distribution and minimizes a divergence—typically the Kullback–Leibler divergence—between them, thus trading exactness for computational efficiency (Blei et al., 2017; Jordan, 1999).
VI (Jordan, 1999) exchanges the inference issues related to the Bayesian approach with an optimization problem. In VI, a family of distributions \(\mathcal{Q}\) over the latent variables \(\bm{\theta}\) needs to be fixed. The aim is to find the element that minimizes its Kullback-Leibler divergence with the intractable posterior \(P(\bm{\theta} | \mathcal{D})\) within that family: \begin{equation} Q^{\star} = \argmin_{Q \in \mathcal{Q}} \operatorname{KL}(Q(\bm{\theta})|P(\bm{\theta} | \mathcal{D}))\,. \end{equation}
Definition 2.8 Kullback-Leibler divergence. Let \(P\) and \(Q\) be two probability distributions over the same probability space \(\Omega\), the Kullback-Leibler divergence \(\mathrm{KL}(Q|P)\) measures the dissimilarity between both distributions as \begin{equation} \mathrm{KL}(Q|P) = \mathbb{E}_Q\left[\log Q(x) - \log P(x)\right]. \end{equation} The Kullback-Leibler divergence is defined if and only if for all \(x \in \Omega\) such that \(P(x) = 0\), then \(Q(x) = 0\). In measure terms, \(Q\) is absolutely continuous with respect to \(P\).
Proposition 2.9. The Kullback-Leibler divergence is always non-negative.
Proof
As the logarithm is bounded by \(x - 1\), \begin{equation} \log{x} \leq x - 1 \implies \frac{P(x)}{Q(x)} - 1 \geq \log{\frac{P(x)}{Q(x)}}. \end{equation} Since probabilities are non-negative, multiply by \(Q(x)\) in the last inequality \begin{equation} P(x) - Q(x) \geq Q(x) \log \frac{P(x)}{Q(x)} = Q(x) \log{P(x)} - Q(x) \log{Q(x)}. \end{equation} Now, integrating both sides \begin{equation} 0 \geq \mathbb{E}_Q[\log{P(x)} - \log{Q(x)}] \implies \mathbb{E}_Q[\log{Q(x)} - \log{P(x)}] \geq 0. \end{equation} Moreover, the Kullback-Leibler divergence is \(0\) if and only if the two distributions are equal almost everywhere. □
Variational Bayesian methods or simply variational methods are primarily used for two purposes:
Perform statistical inference over the unobserved variables by providing an analytical approximation to their posterior probability.
Derive and compute a lower bound for the observed data marginal likelihood \(\log P(\mathcal{D})\). This is commonly used to perform model selection, where a model with a higher marginal likelihood has a greater probability for being the model that generated the data.
These \(Q\) distributions are typically referred to as variational distributions—the term comes from calculus of variations, a field that studies optimization over functions—of the optimization problem. The Kullback-Leibler divergence between the variational distribution \(Q(\bm{\theta})\) and the real distribution \(P(\bm{\theta} |\mathcal{D})\) may be decomposed in the following way: \begin{equation} \begin{aligned} \mathrm{KL}(Q(\bm{\theta})|P(\bm{\theta} | \mathcal{D})) &= \mathbb{E}_{Q(\bm{\theta})}[\log{Q(\bm{\theta})}]- \mathbb{E}_{Q(\bm{\theta})}[\log{P(\bm{\theta} | \mathcal{D})}]\\ &= \mathbb{E}_{Q(\bm{\theta})}[\log{Q(\bm{\theta})}] - \mathbb{E}_{Q(\bm{\theta})}[\log P(\mathcal{D}, \bm{\theta}) - \log P(\mathcal{D}) ]\\ &= \mathbb{E}_{Q(\bm{\theta})}[\log{Q(\bm{\theta})}] - \mathbb{E}_{Q(\bm{\theta})}[\log{P(\mathcal{D}, \bm{\theta})}] + \log{P(\mathcal{D})}\,. \end{aligned} \end{equation} Although the Kullback-Leibler divergence cannot be computed since \(P(\bm{\theta}|\mathcal{D})\) is unknown, an equivalent objective can be optimized; its positiveness can be used to set the following lower bound to the evidence, defined as Evidence Lower Bound or ELBO: \begin{equation} \log{P(\mathcal{D})} \geq - \underbrace{\mathbb{E}_{Q(\bm{\theta})}[\log{Q(\bm{\theta})}]}_{\mathrm{Entropy}} + \underbrace{\mathbb{E}_{Q(\bm{\theta})}[\log{P(\mathcal{D}, \bm{\theta})}]}_{\mathrm{Energy}}\,. \end{equation} Energy and Entropy terms come from statistical physics terminology ((Barber, 2012)). As \(P(\mathcal{D})\) does not depend on \(Q\), minimizing the Kullback-Leibler divergence is equivalent to maximizing the ELBO, where equality holds if and only if \(Q(\bm{\theta}) = P(\bm{\theta} | \mathcal{D})\). The ELBO may be written as \begin{equation} \label{eq:elbo2} \begin{aligned} \text{ELBO}(Q) &= \mathbb{E}_{Q(\bm{\theta})}[\log{P(\bm{\theta})}] + \mathbb{E}_{Q(\bm{\theta})}[\log{P(\mathcal{D} | \bm{\theta})}] - \mathbb{E}_{Q(\bm{\theta})}[\log{Q(\bm{\theta})}]\\ &= \mathbb{E}_{Q(\bm{\theta})}[\log{P(\mathcal{D} | \bm{\theta})}] - \mathrm{KL}(Q(\bm{\theta})|P(\bm{\theta}))\,, \end{aligned} \end{equation} where it is expressed as the sum of the log likelihood of the observations and the Kullback-Leibler divergence between the prior \(P(\bm{\theta})\) and the variational distribution \(Q(\bm{\theta})\). See Figure 2.2 for an illustration of the decomposition. This formula can be further examined: a variational distribution that maximizes the first term is the one that makes the observed data most probable; in contrast, the variational distribution that maximizes the second term is \(Q(\bm{\theta}) = P(\bm{\theta})\), the prior. This gives the intuitive idea that the KL divergence in the second term works as a regularizer for the variational distribution, i.e., this distribution needs to explain the data (maximize the first term) without diverging from the prior distribution (maximize the second term).
Mean Field Family
In most cases, the optimization of the ELBO is intractable if a complex variational family of distributions is selected. A common assumption is that \(\mathcal Q\) is the mean-field variational family. There, \(\mathcal{Q}\) is defined as the family of distributions where the variables are mutually independent, i.e, any \(Q \in \mathcal{Q}\) verifies \begin{equation} Q(\bm{\theta}) = \prod_{j=1}^{P}Q_{j}(\theta_{j}), \end{equation} where \(\bm{\theta} = \{\theta_{1},\dots,\theta_{P}\}\) is the considered set of latent variables. Notice that each factor \(Q_{j}\) can be different and this family does not depend on the observed data.
The mean-field family can capture any marginal of the latent variables but not the correlation between them, as it assumes they are independent. For example, consider a two-dimensional Gaussian distribution where most of the density is concentrated inside the blue ellipse shown in Figure 2.3. A mean-field approximation would factorize as a product of two Gaussian distributions, condensing its density in a circle or flat ellipse as shown in light red. Both cases in the figure highlight the structural limitation of classical mean-field factorizations—namely, their inability to represent off-diagonal covariance—thereby motivating richer variational families for correlated posteriors.
Notice the parametric form of each factor \(Q_{j}\) is not specified, and the appropriate configuration depends on the variable. For example, a continuous variable might have a Gaussian factor, and a categorical variable might have a categorical factor.
Black-box \(\alpha\)-energy
Classical variational inference (VI) frames posterior computation as an optimization problem by choosing a tractable family \(\mathcal Q\) and minimizing \(\mathrm{KL}(Q(\boldsymbol\theta)|P(\boldsymbol\theta|\mathcal D))\). Although convenient, the asymmetric Kullback–Leibler divergence penalizes overlap rather than coverage, leading to under-dispersed approximations in multi–modal settings (Minka, 2005). The black-box \(\alpha\)-energy framework (BB-\(\alpha\)) (Hernandez-Lobato et al., 2016) generalizes VI by replacing the KL with the \(\alpha\)–divergence of Zhu & Rohwer (1995): \begin{equation} D_\alpha(p|q) = \frac{1}{\alpha(1-\alpha)} \left(1-\int P(\boldsymbol\theta)^{\alpha} Q(\boldsymbol\theta)^{1-\alpha}\, \mathrm d\boldsymbol\theta\right), \qquad \alpha\in\mathbb R\setminus\{0,1\}. \label{eq:alpha_div} \end{equation}
Proposition 2.10. The \(\alpha\)–divergence interpolates between the two asymmetric KLs: \begin{equation} \lim_{\alpha\to 0} D_\alpha(P|Q)=\mathrm{KL}(Q|P), \quad \text{and,} \quad \lim_{\alpha\to 1} D_\alpha(P|Q)=\mathrm{KL}(P|Q). \end{equation}
Proof
Write \(I(\alpha):=\int P(\bm \theta)^{\alpha} Q(\bm \theta)^{1-\alpha}\, \mathrm d \bm \theta\). The \(\alpha\)–divergence in Equation \(\eqref{eq:alpha_div}\) can be expressed as \begin{equation} D_{\alpha}(P|Q)=\frac{1-I(\alpha)}{\alpha(1-\alpha)}\,. \end{equation} Because \(\int Q(\bm \theta)\, \mathrm d \bm \theta=1\), it verifies that \(I(0)=1\). L’Hôpital’s rule therefore applies: \begin{equation} \lim_{\alpha\to 0}D_{\alpha}(P|Q) =\lim_{\alpha\to 0} \frac{-I'(\alpha)}{1-2\alpha}\,, \end{equation} where \(I'(\alpha)\) denotes the derivative of \(I\) with respect to \(\alpha\). Differentiate under the integral sign (permitted by the positivity of \(P,Q\)): \begin{equation} I'(\alpha) =\int P(\bm \theta)^{\alpha} Q(\bm \theta)^{1-\alpha} \log\frac{P(\bm \theta)}{Q(\bm \theta)}\, \mathrm d\bm \theta. \end{equation} Evaluating at \(\alpha=0\) yields \(I'(0)=\int Q(\bm \theta)\log\frac{P(\bm \theta)}{Q(\bm \theta)}\, \mathrm{d}\bm \theta\). Consequently \begin{equation} \lim_{\alpha\to 0}D_{\alpha}(P|Q) =-\int Q(\bm \theta)\log\frac{P(\bm\theta)}{Q(\bm \theta)}\, \mathrm d\bm \theta =\mathrm{KL}(Q|P). \end{equation} Using the symmetry \(D_{\alpha}(P|Q)=D_{1-\alpha}(Q|P)\), substitute \(\tilde\alpha=1-\alpha\) and apply the first part: \begin{equation} \lim_{\alpha\to 1}D_{\alpha}(P|Q) =\lim_{\tilde\alpha\to 0}D_{\tilde\alpha}(Q|P) =\mathrm{KL}(P|Q)\,. \end{equation} □
Exact minimization of \(D_{\alpha}(p|q)\) is analytically intractable because the true posterior \(P(\boldsymbol\theta|\mathcal D)\) is unknown. For i.i.d. data \(\mathcal D=\{(\mathbf x_{n},y_{n})\}_{n=1}^{N}\) the objective decomposes as (Li & Gal, 2017) \begin{align} \mathcal L_\alpha(q)= R_{\beta}(Q|P) -\frac{1}{\alpha}\sum_{n=1}^{N} \log\mathbb E_{Q}[P(y_{n}|\mathbf x_{n}, \boldsymbol\theta)^{\alpha}]\, \label{eq:bb_alpha_objective} \end{align} where \(R_{\beta}\) is the Rényi divergence of order \(\beta=N/(N-\alpha)\) (Rényi, 1961): \begin{equation} R_{\beta}(Q|P) :=(\beta-1)^{-1}\log \int Q(\bm \theta)^{\beta}P(\bm \theta)^{1-\beta}\, \mathrm{d}\bm \theta\,. \end{equation} For \(\alpha\ll N\) the Rényi term approaches the familiar \(\mathrm{KL}(Q|P)\), yielding the approximate \(\alpha\)–energy \begin{equation} \tilde{\mathcal L}_{\alpha}(Q) :=\mathrm{KL}(Q|P) -\frac{1}{\alpha}\sum_{n=1}^{N} \log\mathbb E_{Q}[P(y_{n}|\mathbf x_{n}, \boldsymbol\theta)^{\alpha}]. \label{eq:alpha_energy} \end{equation} The expectation inside the logarithm in Equation \(\eqref{eq:alpha_energy}\) is rarely analytic. BB-\(\alpha\) employs Monte Carlo samples \(\{\boldsymbol\theta^{(s)}\}_{s=1}^{S}\sim q(\bm \theta)\) and the reparameterization trick to obtain an unbiased estimate of the energy but a biased estimate of its gradient due to Jensen’s inequality: \begin{equation} \log\mathbb E_{Q}[h(\boldsymbol\theta)] \approx \log\left(\frac1S\sum_{s=1}^{S}h(\boldsymbol\theta^{(s)})\right)\,. \end{equation} The bias decays as \(\mathcal O(S^{-1})\) (Hernandez-Lobato et al., 2016) and is negligible in practice for \(S\ge 10\).
Setting \(\alpha=0\) recovers standard VI, \(\alpha=1\) yields power–EP, and \(\alpha=\tfrac12\) reproduces the Hellinger distance. Values \(\alpha\in(0,1)\) encourage the variational posterior to amortize between mode–seeking (VI) and mass–covering (power–EP) behavior, often giving tighter marginal likelihood estimates and better calibrated predictive variances (Buntine & Jaakkola, 2022; Hernandez-Lobato et al., 2016). Refer to Figure 2.4 for an illustration on the impact of the value on \(\alpha\). For \(\alpha\to-\infty\) — the solution collapses onto the narrowest mode, reproducing the extreme mode–seeking behavior of the reverse KL; \(\alpha=0\) — reverse (standard variational Bayes), which still concentrates on a single mode but with finite variance; \(\alpha=0.5\) — a symmetric choice that balances mode fidelity and mass coverage; \(\alpha=1\) — forward (power–EP limit), yielding a mass–covering approximation that sacrifices peak height to explain both modes; \(\alpha\to\infty\) — the approximation centers on the broader–variance mode, emphasizing coverage of high–mass regions. Together, the panels demonstrate the continuous transition from mode–seeking to mass–covering behavior as \(\alpha\) moves from negative to positive infinity, thereby motivating the use of \(\alpha\)–divergence objectives to trade off accuracy and calibration in approximate Bayesian inference.
Gaussian Processes
GPs furnish a principled, non‑parametric Bayesian framework for function approximation in which prior assumptions are encoded through a mean function \(m\) and a positive‑definite covariance (kernel) function \(K\) (Rasmussen & Williams, 2006). The GP prior, combined with a suitable likelihood, yields closed‑form predictive distributions that quantify epistemic uncertainty—an ability that is indispensable in safety‑critical and uncertainty‑sensitive applications (Deisenroth et al., 2015; Girard et al., 2003). Let \((X_1,\dots,X_n)\) be a vector of real-valued random variables with joint distribution: \begin{equation} (X_1,\dots,X_n)\; \sim\; \mathcal N(\boldsymbol{\mu},\boldsymbol{\Sigma}), \end{equation} where \(\boldsymbol{\mu}\in\mathbb R^{n}\) and \(\boldsymbol{\Sigma}\in\mathbb R^{n\times n}\) denote the mean vector and positive-definite covariance matrix, respectively. For any index set \(\mathcal J\subseteq\{1,\dots,n\}\) with \(m:=|\mathcal J|\), define the sub-vector and sub-matrix \begin{equation} \bm{\mu}_{\mathcal J} \;:=\;(\mu_{j})_{j\in\mathcal J}\in\mathbb R^{m}, \qquad \bm{\Sigma}_{\mathcal J} \;:=\;(\Sigma_{jk})_{j,k\in\mathcal J}\in\mathbb R^{m\times m}. \end{equation} Marginalization of a multivariate Gaussian preserves Gaussianity; hence \begin{equation} (X_j)_{j\in\mathcal J} \;\sim\; \mathcal N(\boldsymbol{\mu}_{\mathcal J}, \boldsymbol{\Sigma}_{\mathcal J}) \quad\text{for every } \mathcal J\subseteq\{1,\dots,n\}. \end{equation} Conversely, any finite set of evaluations drawn from a Gaussian process must satisfy this same marginal consistency property.
Definition 2.11 Gaussian process. A Gaussian process is a collection \(\{f(\mathbf x)\}_{\mathbf x\in\mathbb R^M}\) of random variables such that every finite sub-collection is jointly Gaussian. For a given mean \(m:\mathbb{R} \to \mathbb{R}\) and covariance \(K:\mathbb{R}\times \mathbb{R} \to \mathbb{R}\) functions, it is denoted as \begin{equation*} f\;\sim\;\mathcal{GP}\bigl(m(\cdot),\,K(\cdot,\cdot)\bigr). \end{equation*}
Given an input design matrix \(\mathbf X=(\mathbf x_1,\dots,\mathbf x_N)^{\mathsf T}\in\mathbb R^{N\times M}\), let \(\mathbf f=(f(\mathbf x_1),\dots,f(\mathbf x_N))^{\mathsf T}\). By Definition 2.11, \begin{equation} \mathbf f\;\sim\;\mathcal N\bigl(m(\mathbf X),\,K(\mathbf X,\mathbf X)\bigr), \end{equation} where \(m(\mathbf X):=[m(\mathbf x_1),\dots,m(\mathbf x_N)]^{\mathsf T}\) and \([K(\mathbf X,\mathbf X)]_{ij}=K(\mathbf x_i,\mathbf x_j)\).
Learning the parameters of a finite‑dimensional Gaussian distribution requires \(N+N(N+1)/2\) degrees of freedom—a regime that is typically over‑determined in classical parametric models such as linear regression. In contrast, GP regression assumes that the latent function \(f\) is a realization of an infinite‑dimensional stochastic process, making the inference problem under‑determined. Practical inference usually proceeds by restricting the kernel to a parametric family and, without loss of generality, setting \(m= 0\).
A widely used choice is the squared-exponential or radial-basis-function (RBF) kernel: \begin{equation} \label{eq:rbf_kernel} K_{\text{RBF}}(\mathbf x,\mathbf x') = \sigma^{2}\exp\left[-\frac{\lVert\mathbf x-\mathbf x'\rVert_{2}^{2}}{2\ell^{2}}\right], \end{equation} parameterized by a characteristic length‑scale \(\ell>0\) and marginal variance \(\sigma^{2}>0\). The kernel in Equation \(\eqref{eq:rbf_kernel}\) is infinitely differentiable, thereby encoding strong prior beliefs about the smoothness of \(f\) (Stein, 1999). Figure 2.5 visualizes typical sample paths and induced marginal/joint densities for the hyper‑parameters \(\ell=0.30\) and \(\sigma^{2}=1\).
Gaussian Likelihood and Posterior Prediction
In supervised regression, one observes noisy targets \(y_i=f(\mathbf x_i)+\varepsilon_i\) with i.i.d. noise \(\varepsilon_i\sim\mathcal N(0,\sigma^{2}_{\varepsilon})\). Denote by \(\mathbf y=(y_1,\dots,y_N)^{\mathsf T}\) the training outputs. For a collection of test inputs \(\mathbf X^{\star}\) the joint prior over training and test function values is \begin{equation} \begin{bmatrix} \mathbf f\\ \mathbf f^{\star} \end{bmatrix} \sim \mathcal N\left( \begin{bmatrix} m(\mathbf X)\\ m(\mathbf X^{\star}) \end{bmatrix}, \begin{bmatrix} K(\mathbf X,\mathbf X) & K(\mathbf X,\mathbf X^{\star})\\ K(\mathbf X^{\star},\mathbf X) & K(\mathbf X^{\star},\mathbf X^{\star}) \end{bmatrix}\right). \end{equation} Because the convolution of Gaussians remains Gaussian (Murphy, 2012), marginalizing the latent \(\mathbf f\) yields the joint distribution \begin{equation} \begin{bmatrix} \mathbf y\\ \mathbf f^{\star} \end{bmatrix} \sim \mathcal N\left( \begin{bmatrix} m(\mathbf X)\\ m(\mathbf X^{\star}) \end{bmatrix}, \begin{bmatrix} K(\mathbf X,\mathbf X)+\sigma^{2}_{\varepsilon}\mathbf I & K(\mathbf X,\mathbf X^{\star})\\ K(\mathbf X^{\star},\mathbf X) & K(\mathbf X^{\star},\mathbf X^{\star}) \end{bmatrix}\right). \end{equation} Conditioning on the observed data delivers the GP predictive distribution (Rasmussen & Williams, 2006): \begin{equation} \label{eq:gp_pred} \mathbf f^{\star}|\mathbf y \;\sim\; \mathcal N(\boldsymbol{\mu}^{\star},\,\boldsymbol{\Sigma}^{\star}), \end{equation} where \begin{align} \boldsymbol{\mu}^{\star} &= m(\mathbf X^{\star}) + K(\mathbf X^{\star},\mathbf X) \bigl[K(\mathbf X,\mathbf X)+\sigma^{2}_{\varepsilon}\mathbf I\bigr]^{-1} \bigl(\mathbf y-m(\mathbf X)\bigr), \\ \boldsymbol{\Sigma}^{\star} &= K(\mathbf X^{\star},\mathbf X^{\star}) - K(\mathbf X^{\star},\mathbf X) \bigl[K(\mathbf X,\mathbf X)+\sigma^{2}_{\varepsilon}\mathbf I\bigr]^{-1} K(\mathbf X,\mathbf X^{\star}). \end{align} Figure 2.6 illustrates the effect of observing five noisy data points on the posterior mean and variance under an RBF kernel.
Although the squared-exponential kernel is a convenient default, its strong smoothness assumption can lead to model mismatch in the presence of high-frequency or non-stationary phenomena (Rasmussen & Williams, 2006).
Hyper–Parameter Learning
Let \(\boldsymbol\theta=(\sigma_\varepsilon^{2},\vartheta)\) denote the collection of hyper-parameters, comprising the observation-noise variance \(\sigma_\varepsilon^{2}\) together with the kernel parameters \(\vartheta\) (length–scales, spectral amplitudes, etc.). Conditioning on a data set \(\mathcal D=\{(\mathbf x_{i},y_{i})\}_{i=1}^{N}\) and adopting the Gaussian–likelihood model \(y_i=f(\mathbf x_i)+\varepsilon_i\) with \(\varepsilon_i\overset{\text{iid}}{\sim}\mathcal N(0,\sigma_\varepsilon^{2})\), the marginal likelihood (or evidence) is obtained by integrating out the latent function values \(\mathbf f:=f(\mathbf X)\): \begin{align} \log P(\mathbf y|\boldsymbol\theta) &=\log\int P(\mathbf y|\mathbf f,\sigma_\varepsilon^{2})\, P(\mathbf f|\vartheta)\,\mathrm d\mathbf f \nonumber\\ &= -\tfrac12\mathbf y^{\mathsf T}\mathbf C^{-1}\mathbf y -\tfrac12\log\det\mathbf C -\tfrac N2\log 2\pi, \label{eq:gp_log_evidence} \end{align} where \(\mathbf C=\mathbf K_{\vartheta}+\sigma_\varepsilon^{2}\mathbf I\) and \(\bigl[\mathbf K_{\vartheta}\bigr]_{ij} =K_{\vartheta}(\mathbf x_i,\mathbf x_j)\). Equation \(\eqref{eq:gp_log_evidence}\) decomposes into a data–fit term (first), a complexity penalty proportional to the model capacity (second), and a constant normalization term. Maximizing the evidence embodies Occam’s razor by trading off goodness of fit against model complexity (MacKay, 1992; Rasmussen & Williams, 2006). Gradients with respect to \(\boldsymbol\theta\) can be computed analytically, \begin{equation} \frac{\partial\log p(\mathbf y|\boldsymbol\theta)}{\partial\theta_j} =\frac{1}{2} \, \mathrm{tr}\left[\left(\mathbf\alpha\mathbf\alpha^{T} -\mathbf C^{-1}\right)\frac{\partial\mathbf C}{\partial\theta_j}\right] \end{equation} with \(\mathbf\alpha=\mathbf C^{-1}\mathbf y\), making L-BFGS or conjugate-gradient optimization straightforward (Nocedal & Wright, 2006).
In practice, squared–exponential kernels are commonly equipped with Automatic Relevance Determination (ARD) length-scales, assigning a separate smoothness parameter to each input coordinate; maximizing the marginal likelihood \(\log P(\mathbf{y}| \bm{\theta})\) then performs data-driven feature weighting and, in effect, feature selection (Rasmussen & Williams, 2006). Figure 2.7 complements this discussion by showing samples drawn from Gaussian processes with different kernels, illustrating how kernel choice (and its hyper-parameters) controls the qualitative properties of functions—e.g., smoothness, periodicity, or roughness. This qualitative view connects to hyperparameter learning because the marginal likelihood can be used not only to tune \(\bm{\theta}\) within a kernel family but also to compare kernel families and select the one expected to generalize best. Alternative Bayesian treatments place a prior on \(\bm{\theta}\) and sample from the hyper-posterior via Markov chain Monte Carlo (Filippone & Engler, 2015; Gilks et al., 1995). When gradients are unreliable or the objective is multi-modal, Bayesian optimization of the marginal likelihood itself is an effective strategy (Snoek et al., 2012).
Computational Considerations
Exact inference requires solving a linear system and evaluating a log-determinant for the \(N\times N\) matrix \(\mathbf C\), incurring \(\mathcal O(N^{3})\) floating-point operations and \(\mathcal O(N^{2})\) memory. Consequently, vanilla GPs are limited to a few thousand training points on commodity hardware (Rasmussen & Williams, 2006). Research over the past two decades has produced two broad solution classes:
Structured algebra exploits Kronecker, Toeplitz or inducing Kronecker product structure to obtain nearly linear scaling (Gardner et al., 2018; Wilson & Nickisch, 2015).
Low-rank or sparse approximations replace the full covariance by a rank-\(M\) surrogate with \(M\ll N\), leading to \(\mathcal O(NM^{2})\) training cost. Within this family, variational inducing-point methods dominate due to their solid theoretical footing and favorable empirical accuracy (Hensman et al., 2013; Titsias, 2009).
Inducing–Point Approximations
Exact GP inference scales cubically with the number of training cases \(N\) and therefore becomes prohibitive for data sets that nowadays easily reach \(10^{5}\)–\(10^{6}\) points. A fruitful line of research mitigates this cost by introducing a much smaller set of \(M\ll N\) inducing inputs (or pseudo–inputs) \(\mathbf Z=\{\mathbf z_{m}\}_{m=1}^{M}\) together with their function values \(\mathbf u=f(\mathbf Z)=(f(\mathbf z_{1}),\dots,f(\mathbf z_{M}))^{\mathsf T}\) (Snelson & Ghahramani, 2006). This section provides a comprehensive discussion of the resulting sparse Gaussian–process (SGP) framework and its variational reformulation.
Historical Perspective and Deterministic Schemes
The earliest sparse models such as the subset of regressors (SoR) approximation replaced the full covariance \(\mathbf K_{NN}\) by the Nyström factor \(\mathbf K_{N M}\mathbf K_{M M}^{-1}\mathbf K_{M N}\), reducing storage to \(\mathcal O(NM)\) but at the cost of severe underestimation of predictive uncertainty (Smola & Schölkopf, 2000). The deterministic training conditional (DTC), partially independent training conditional (PITC) and the fully independent training conditional (FITC) successively restored more of the original covariance structure while retaining \(\mathcal O(NM^{2})\) complexity (Quiñonero-Candela & Rasmussen, 2005). These methods can be interpreted as exact inference in a modified probabilistic model; hence they do not optimise a true marginal likelihood with respect to \(\mathbf Z\).
Subset of regressors (SoR).
The original sparse–GP idea is to retain only a small subset \(\mathbf Z=\{\mathbf z_{m}\}_{m=1}^{M}\) with \(M\!\ll\!N\) and to build the covariance of the full data set from these \(M\) points via the Nyström approximation \begin{equation} \tilde{\mathbf K}_{\text{SoR}} \;=\; \mathbf K_{N M}\, \mathbf K_{M M}^{-1}\, \mathbf K_{M N}, \qquad \mathbf K_{A B}:=K(\mathbf A,\mathbf B), \end{equation} first proposed by Smola & Schölkopf (2000). Because all computations now factor through the \(M\times M\) matrix \(\mathbf K_{M M}\), training requires only \(\mathcal{O}(N M^{2})\) time and \(\mathcal{O}(N M)\) memory, while each test prediction costs \(\mathcal{O}(M^{2})\). The price of this efficiency is that \(\operatorname{rank}\bigl(\tilde{\mathbf K}_{\text{SoR}}\bigr)\!\le\!M\); consequently the predictive variance vanishes as the test input \(\mathbf x^{\star}\) moves far from all inducing points, yielding overly confident forecasts at the fringes of the input space.
Deterministic training conditional (DTC).
DTC corrects this problem by retaining the exact marginal variance on the training set: \(\tilde{\mathbf K}_{\mathrm{DTC}} =\tilde{\mathbf K}_{\mathrm{SoR}} +\mathrm{diag}\bigl(\mathbf K_{NN}-\tilde{\mathbf K}_{\mathrm{SoR}}\bigr)\) (Seeger, 2003). Predictive variances are now sensible away from data, yet DTC preserves only diagonal residuals and therefore fails to model inter-point correlations.
Partially and fully independent training conditionals.
PITC partitions the data into \(B\) disjoint blocks \(\{\mathcal I_{b}\}_{b=1}^{B}\) and adds back block-diagonal residuals: \begin{equation} \tilde{\mathbf K}_{\mathrm{PITC}} =\tilde{\mathbf K}_{\mathrm{SoR}} +\mathrm{blkdiag}\, _{b=1}^{B} \Bigl(\mathbf K_{\mathcal I_{b}\mathcal I_{b}} -\tilde{\mathbf K}_{\mathrm{SoR},\mathcal I_{b}}\Bigr). \end{equation} For \(B=N\) PITC reduces to the fully independent training conditional (FITC) of Snelson & Ghahramani (2006), which adds back all diagonal residual terms. Both schemes require \(\mathcal O(NM^{2})\) time and \(\mathcal O(NM)\) memory, and critically they restore the exact predictive variance at the training inputs while maintaining tractable algebra: \begin{align*} \mu_{\mathrm{FITC}}(\mathbf x^{\star}) &=\mathbf q_{x^{\star}}^{\mathsf T} \bigl(\mathbf Q_{NN}+\bm\Lambda+\sigma_\varepsilon^{2}\mathbf I\bigr)^{-1} \mathbf y,\\ \sigma^{2}_{\mathrm{FITC}}(\mathbf x^{\star}) &=k_{x^{\star}x^{\star}} -\mathbf q_{x^{\star}}^{\mathsf T} \bigl(\mathbf Q_{NN}+\bm\Lambda+\sigma_\varepsilon^{2}\mathbf I\bigr)^{-1} \mathbf q_{x^{\star}}, \end{align*} with \(\mathbf q_{x^{\star}}=\mathbf K_{M M}^{-1}\mathbf K_{M x^{\star}}\), \(\mathbf Q_{NN}=\mathbf K_{N M}\mathbf K_{M M}^{-1}\mathbf K_{M N}\) and \(\bm\Lambda=\mathrm{diag}(\mathbf K_{NN}-\mathbf Q_{NN})\). Empirically, FITC attains the lowest root-mean-square error among deterministic sparse schemes but exhibits a pronounced tendency to inflate length-scales during evidence maximization (Bauer et al., 2016).
Interpretation as exact inference in an amended model.
All of the above approximations can be written \(p_{\text{approx}}(\mathbf f) =\mathcal N(\mathbf0,\tilde{\mathbf K})\) with \(\tilde{\mathbf K}\) not equal to the original kernel matrix; therefore, performing exact GP inference in a different probabilistic model. Hyper-parameter learning optimizes the marginal likelihood of this amended model, which can lead to systematic bias in the inferred length-scales and noise variance (Titsias, 2009). The variational framework of the following section addresses this shortcoming by optimizing a genuine lower bound on the true evidence.
Summary of deterministic approximations.
Table 2.1 contrasts algebraic form, computational cost, and known pathologies. Although superseded in most applications by variational inducing-point methods, deterministic schemes remain useful for debugging and as initializers because they require no optimization over covariance matrices.
| Method | \(\tilde{\mathbf K}\) structure | Train cost | Known issues |
|---|---|---|---|
| SoR | rank-\(M\) Nyström | \(NM^{2}\) | variance collapse |
| DTC | rank-\(M\) \(+\) diag | \(NM^{2}\) | no off-diag residuals |
| FITC | rank-\(M\) \(+\) diag residual | \(NM^{2}\) | length-scale inflation |
| PITC | rank-\(M\) \(+\) block-diag residual | \(NM^{2}\) | block selection heuristic |
Variational Formulation
A principled alternative casts sparsification as variational inference (Titsias, 2009). Select a set of inducing locations \(\mathbf Z \subset \mathcal{X}\), then augment the GP prior with \(\mathbf u := f(\mathbf Z)\), \begin{equation} P(\mathbf f,\mathbf u) = P(\mathbf f|\mathbf u)\,P(\mathbf u) = \mathcal N(\mathbf f; \mathbf K_{N M}\mathbf K_{M M}^{-1}\mathbf u,\, \mathbf K_{N N}-\mathbf Q_{N N})\, \mathcal N\left(\mathbf u;\mathbf 0,\mathbf K_{M M}\right), \end{equation} where \(\mathbf Q_{N N}=\mathbf K_{N M}\mathbf K_{M M}^{-1}\mathbf K_{M N}\). Introducing a variational joint density \(Q(\mathbf f, \mathbf u) = P(\mathbf f|\mathbf u) Q(\mathbf u)\), with \(Q(\mathbf u)=\mathcal N(\boldsymbol\mu,\mathbf S)\) yields the evidence lower bound: \begin{equation} \mathcal L =\sum_{i=1}^{N}\mathrm E_{Q(f_i)}[\log P(y_i| f_i)] -\mathrm{KL}[Q(\mathbf u)|P(\mathbf u)], \qquad Q(\mathbf f)=\int P(\mathbf f|\mathbf u)Q(\mathbf u)\,\mathrm d\mathbf u\,. \label{eq:elbo_titsias} \end{equation} Optimizing \(\mathcal L\) with respect to \(\boldsymbol\mu\) and \(\mathbf S\) in closed form collapses the bound to the FITC marginal likelihood plus an information–theoretic correction term that discourages degenerate covariance approximations (Bauer et al., 2016). Hyper–parameters \(\boldsymbol\theta\) and inducing locations \(\mathbf Z\) are then learned by gradient ascent on Equation \(\eqref{eq:elbo_titsias}\). Critically, \(\mathcal O(NM^{2})\) time and \(\mathcal O(NM)\) memory make mini-batch stochastic optimization possible, paving the way for the Stochastic Variational GP (SVGP) of Hensman et al. (2013).
The predictive distribution at test inputs \(\mathbf X^{\star}\) remains analytic, being Gaussian with: \begin{align*} \mathbb E_{q}[f(\mathbf X^{\star})] &=\mathbf K_{\star M}\mathbf K_{M M}^{-1}\bm\mu\,,\\ \mathrm{Var}_{q}[f(\mathbf X^{\star})] &=\mathbf K_{\star}-\mathbf K_{\star M}\mathbf K_{M M}^{-1} (\mathbf K_{M M}-\mathbf S) \mathbf K_{M M}^{-1}\mathbf K_{M \star}\,. \end{align*} The computational cost is now \(\mathcal O(M^{2})\) per test point, independent of \(N\). See Figure 2.8 for an illustration of this method.
The number \(M\) and the placement of \(\mathbf Z\) heavily influence predictive accuracy and computational cost. Common heuristics initialize \(\mathbf Z\) with \(k\)–means centroids, random subsets of data, Latin–hypercube designs, or condition–number–based strategies (Rubin & Stein, 2016). Empirical studies show that subsequent gradient refinement is essential; otherwise, the optimizer partly compensates by inflating length–scales, leading to over–smooth predictions (Burt & Rasmussen, 2020).
| Method | Train time | Predict time | Memory |
|---|---|---|---|
| Exact GP | \(\mathcal O(N^{3})\) | \(\mathcal O(PN^{2})\) | \(\mathcal O(N^{2})\) |
| FITC / PITC | \(\mathcal O(NM^{2})\) | \(\mathcal O(PM^{2})\) | \(\mathcal O(NM)\) |
| SVGP (mini-batch) | \(\mathcal O(BM^{2})\) | \(\mathcal O(PM^{2})\) | \(\mathcal O(NM)\) or \(\mathcal O(M^{2})\) with caching |
Table 2.2 summarizes the asymptotic trade–offs. For modern GPU-accelerated libraries, \(M\) in the low thousands is common, permitting data sets two orders of magnitude larger than those tractable by direct Cholesky factorization while retaining high fidelity to the exact GP posterior (Burt & Rasmussen, 2020).
Bayesian Linear–Regression Surrogates for Gaussian Processes
Exact GP regression inherits the cubic computational cost of inverting the \(N\times N\) kernel matrix \(\mathbf K\). A complementary route to scalability is to approximate the GP itself by a Bayesian linear model in a finite feature space. Early instances include relevance–vector machines (Quiñonero-Candela & Rasmussen, 2005; Tipping, 2001), while modern treatments exploit random–feature expansions (Balog et al., 2016; Gal & Turner, 2015; Rahimi & Recht, 2007).
Let \(K:\mathcal X\times\mathcal X\to\mathbb R\) be a continuous, symmetric, positive–definite kernel whose diagonal \(K(\mathbf x,\mathbf x)\) is integrable. By Mercer’s theorem (Mercer, 1909), there exist non-negative eigenvalues \(\{\lambda_{i}\}_{i=1}^{\infty}\) and an orthonormal set of eigen-functions \(\{\psi_{i}\}_{i=1}^{\infty}\subset L^{2}(\mathcal X)\) such that \begin{equation} \label{eq:mercer_expansion} K(\mathbf x,\mathbf x') = \sum_{i=1}^{\infty}\lambda_{i}\psi_{i}(\mathbf x)\psi_{i}(\mathbf x')\,. \end{equation} Define the (countably infinite) feature map
\begin{equation} \begin{aligned} \boldsymbol\phi:\mathcal X&\to\ell^{2}\\ \mathbf x & \mapsto (\sqrt{\lambda_{1}}\psi_{1}(\mathbf x), \sqrt{\lambda_{2}}\psi_{2}(\mathbf x),\dots)^{\mathsf T}. \end{aligned} \end{equation} Then \(K\) is the inner product in feature space, \(K(\mathbf x,\mathbf x')=\bm\phi(\mathbf x)^{\mathsf T} \bm\phi(\mathbf x')\). Using a zero prior mean (extensions to \(m(\cdot)\neq0\) are straightforward) and substituting Equation \(\eqref{eq:mercer_expansion}\) into the standard GP predictive equations yields \begin{align} \mu^{\star} &= \boldsymbol\phi(\mathbf x^{\star})^{\mathsf T} \boldsymbol\Phi(\mathbf X) \bigl[\boldsymbol\Phi(\mathbf X)^{\mathsf T} \boldsymbol\Phi(\mathbf X)\bigr]^{-1}\mathbf y, \label{eq:feature_mean}\\ \Sigma^{\star} &= \boldsymbol\phi(\mathbf x^{\star})^{\mathsf T}\boldsymbol\phi(\mathbf x^{\star}) - \boldsymbol\phi(\mathbf x^{\star})^{\mathsf T} \boldsymbol\Phi(\mathbf X) \bigl[\boldsymbol\Phi(\mathbf X)^{\mathsf T} \boldsymbol\Phi(\mathbf X)\bigr]^{-1} \boldsymbol\Phi(\mathbf X)^{\mathsf T} \boldsymbol\phi(\mathbf x^{\star}),\label{eq:feature_var} \end{align} with \(\boldsymbol\Phi(\mathbf X) :=[\boldsymbol\phi(\mathbf x_{1}),\dots,\boldsymbol\phi(\mathbf x_{N})]^{\mathsf T}\). Equations \(\eqref{eq:feature_mean}\)–\(\eqref{eq:feature_var}\) coincide with the posterior of a Bayesian linear model \(f(\mathbf x)=\boldsymbol\phi(\mathbf x)^{\mathsf T}\mathbf w\), \(\mathbf w\sim\mathcal N(\mathbf0,\mathbf I)\). The obstacle is that the feature map is infinite–dimensional.
For shift‐invariant kernels \(K(\mathbf x,\mathbf x')=k(\mathbf x-\mathbf x')\) Bochner’s theorem states that \(k\) is the Fourier transform of a finite non–negative measure \(\tau\). Drawing \(\{\boldsymbol\omega_{s}\}_{s=1}^{S}\sim \tau\) and random phases \(\{b_{s}\}_{s=1}^{S}\sim\mathcal U[0,2\pi]\) one obtains the random Fourier feature (RFF) map \begin{equation} \hat{\boldsymbol\phi}_{\, \text{RFF}}(\mathbf x) =\frac{1}{\sqrt S} \bigl[\cos(\bm\omega_{0}^{\mathsf T}\mathbf x+b_{0}),\, \sin(\bm\omega_{0}^{\mathsf T}\mathbf x+b_{0}), \, \dots, \, \cos(\bm\omega_{S}^{\mathsf T}\mathbf x+b_{S}),\, \sin(\bm\omega_{S}^{\mathsf T}\mathbf x+b_{S}) \bigr]\,. \end{equation} Then \(\mathbb E[\hat{\boldsymbol\phi}_{\text{RFF}}(\mathbf x)^{\mathsf T} \hat{\boldsymbol\phi}_{\text{RFF}}(\mathbf x')] =K(\mathbf x,\mathbf x')\) and the Monte–Carlo error decays as \(\mathcal O(S^{-1/2})\) (Rahimi & Recht, 2007). A Bayesian linear model with these features retains closed form predictive moments and reduces training from \(\mathcal O(N^{3})\) to \(\mathcal O(NS^{2})\).
Bach (2017) shows that, for kernels with spectral density bounded away from zero and infinity, an RFF GP with \(S=\mathcal O(d\,\epsilon^{-2}\log d)\) features approximates the full GP within \(\epsilon\) in operator norm with high probability. Burt & Rasmussen (2020) further proves that the marginal‐ likelihood error decays as \(\mathcal O(S^{-1})\).
Kronecker and circulant structures can reduce the cost of sampling and multiplication with \(\hat{\boldsymbol\phi}(\mathbf x)\) from \(\mathcal O(S)\) to \(\mathcal O(\log S)\) without degrading accuracy (Le et al., 2013; Wilson & Nickisch, 2015). Tensor‐train and Gaussian‐quadrature features extend RFFs to Matérn kernels and low‐dimensional manifolds (Novikov & Izmailov, 2018).
If the spectral measure \(\tau\) is replaced by the empirical distribution on a subset of training points, the Bayesian linear surrogate recovers Nyström inducing‐point GPs (Section 2.3.4.1). Placing an independent AR(1) hyper‐prior on the weights yields the RVM sparsity‐promoting posterior (Tipping, 2001). Hence, random-feature GPs unify earlier sparse approaches under a single linear–regression view.
With \(S\ll N\) the dominant cost of Bayesian random‐feature regression is the inversion of the \(S\times S\) weight covariance, \(\mathcal O(S^{3})\), plus \(\mathcal O(NS^{2})\) to compute the design matrix \(\hat{\boldsymbol\Phi}(\mathbf X)\). Stochastic optimisation and conjugate gradients reduce both costs to \(\mathcal O(BS)\) per iteration, making GPs feasible for \(N\gtrsim 10^{6}\) on commodity GPUs.
Dual Interpretation of Gaussian Processes Via Gaussian Measures
This section takes the familiar idea of a Gaussian process (GP) and translates it into the language of functional analysis. After briefly reminding the reader what inner‑product and Hilbert spaces are, it shows how every GP can be seen as a special kind of Gaussian measure living in an infinite‑dimensional function space. Thinking in measures rather than kernels may feel abstract, but it pays off later: it allows the thesis to borrow powerful operator‑theoretic tools and provides a unifying viewpoint for the variational approach in Chapter 4. In short, this section builds the mathematical “bridge” that the rest of the thesis will repeatedly cross whenever it needs to move between the concrete world of kernels and the more flexible world of measures.
Throughout this thesis, the language of functional analysis—most notably, the theory of Hilbert spaces—plays a central role. This section collects the essential concepts and clarifies the notation used henceforth. Standard references include (Conway, 1990; Reed & Simon; 1980; Rudin, 1991). For further information on abstract Wiener spaces and their construction, refer to (Bogachev, 1998; Gross; 1967; Kuo, 2006; Van Neerven et al., 2010).
Definition 2.12 Inner–product spaces. A (real) inner–product space is a vector space \((\mathcal V,+,\cdot)\) over \(\mathbb R\) equipped with a bilinear form \(\langle\cdot,\cdot\rangle_{\mathcal V}:\mathcal V\times\mathcal V\to\mathbb R\) such that, for all \(u,v\in\mathcal V\) and \(\lambda\in\mathbb R\),
symmetry: \(\langle u,v\rangle_{\mathcal V} =\langle v,u\rangle_{\mathcal V}\);
positive definiteness: \(\langle v,v\rangle_{\mathcal V}\ge 0\) with equality if and only if \(v=0\).
The inner product induces the norm \(\|v\|_{\mathcal V}:=\sqrt{\langle v,v\rangle_{\mathcal V}}\) and the associated metric \(d(u,v):=\|u-v\|_{\mathcal V}\).
Definition 2.13 Hilbert spaces. A Hilbert space \((\mathcal H,\langle\cdot,\cdot\rangle_{\mathcal H})\) is an inner–product space that is complete: every Cauchy sequence converges in the norm. Completeness ensures the existence of orthonormal bases and enables orthogonal projection onto closed subspaces, paralleling Euclidean–space geometry.
Notation.
Inner products appear as \(\langle f,g\rangle_{\mathcal H}\). When the space is clear from context, the subscript is omitted.
Norms are written \(\|f\|_{\mathcal H}:=\sqrt{\langle f,f\rangle_{\mathcal H}}\).
\(\mathcal L(\mathcal H)\) denotes the space of bounded linear operators on \(\mathcal H\); \(\mathcal L^{+}(\mathcal H)\) is the cone of positive, self–adjoint operators: \begin{equation} \mathcal{L}^{+}(\mathcal{H}) := \left\{ A:\mathcal{H} \to \mathcal{H}, \ A = A^\star\,,\; A \geq 0 \right\}. \end{equation}
Canonical examples.
\(\mathbb R^{d}\) with the Euclidean inner product \(\langle u,v\rangle=\sum_{i=1}^{d}u_{i}v_{i}\).
The sequence space \(\ell^{2}:=\{a=(a_{1},a_{2},\dots):\sum_{i}a_{i}^{2}<\infty\}\) with \(\langle a,b\rangle=\sum_{i}a_{i}b_{i}\).
The function space \(L^{2}(\mathcal X,\mu)\) consisting of (equivalence classes of) square–integrable functions with inner product \(\langle f,g\rangle=\int_{\mathcal X}f(x)g(x)\,\mathrm d\mu(x)\).
Definition 2.14 Reproducing–kernel Hilbert spaces. A Hilbert space \(\mathcal H\subset\mathbb R^{\mathcal X}\) of functions on \(\mathcal X\) is a reproducing–kernel Hilbert space (RKHS) if the point–evaluation functional \(\delta_{\mathbf x}:f\mapsto f(\mathbf x)\) is continuous for every \(\mathbf x\in\mathcal X\). By the Riesz representation theorem, there exists a unique element \(\phi_{\mathbf x}\in\mathcal H\) such that \(f(\mathbf x)=\langle\phi_{\mathbf x},f\rangle_{\mathcal H}\); the kernel \(K(\mathbf x,\mathbf x') :=\langle\phi_{\mathbf x},\phi_{\mathbf x'}\rangle_{\mathcal H}\) is symmetric, positive–definite, and reproduces \(\mathcal H\).
By Moore–Aronszajn theorem (Aronszajn, 1950), if \(K\) is a positive definite kernel on \(\mathcal{X}\), then, there exists a unique Hilbert space of functions on \(\mathcal{H}\) for which \(K\) is a reproducing kernel. More precisely, let \(\mathcal{H}_0(\mathcal{X})\) be the linear span of \(K\) on \(\mathcal{X}\) defined as \begin{equation} \mathcal{H}_0(\mathcal{X}) := \Big\{\sum_{i=1}^n a_i K(\cdot, \mathbf{x}_i) :\, n \in \mathbb{N}, \ a_i \in \mathbb{R}, \ \mathbf{x}_i \in \mathcal{X}\Big\}\,. \end{equation} By the Moore–Aronszajn theorem, the completion of \(\mathcal H_0(\mathcal X)\) with respect to \begin{equation} \Big\langle \sum_{i=1}^n a_i\,K(\,\cdot\,,\mathbf x_i),\ \sum_{j=1}^m b_j\,K(\,\cdot\,,\mathbf x'_j)\Big\rangle_{\mathcal H_0} := \sum_{i=1}^n\sum_{j=1}^m a_i b_j\, K(\mathbf x_i,\mathbf x'_j)\,, \end{equation} is a Hilbert space \(\mathcal H\) verifying the reproducing property with \(\phi_\mathbf{x} = K(\cdot, \mathbf{x}), \ \forall \mathbf{x} \in \mathcal{X}\).
Definition 2.15 Real Banach space. Let \(\mathcal X\) be a set, then, a Banach space of functions is a pair \((\mathcal{B},\|\cdot\|_{\mathcal{B}})\) such that
\(\mathcal{B}\) is a vector space of (equivalence classes of) functions \(f:\mathcal X \to\mathbb{R}\).
\(\|\cdot\|_{\mathcal{B}}:\mathcal{B}\to[0,\infty)\) is a norm, i.e.,
Positive definiteness: \(\|f\|_{\mathcal{B}}=0 \iff f=0\);
Homogeneity: \(\|\lambda f\|_{\mathcal{B}} = |\lambda|\,\|f\|_{\mathcal{B}}\) for all \(\lambda\in\mathbb{R}\) and \(f\in\mathcal{B}\);
Triangle inequality: \(\|f+g\|_{\mathcal{B}} \le \|f\|_{\mathcal{B}}+\|g\|_{\mathcal{B}}\) for all \(f,g\in\mathcal{B}\);
\(\bigl(\mathcal{B},\|\cdot\|_{\mathcal{B}}\bigr)\) is complete: every Cauchy sequence \((f_n)_{n\in\mathbb{N}}\subset\mathcal{B}\) converges in norm to some \(f\in\mathcal{B}\), i.e. \(\displaystyle\lim_{n\to\infty}\|f_n-f\|_{\mathcal{B}}=0\).
It is worth mentioning that every Hilbert space is a Banach space.
In any separable (contains a countable, dense subset) Banach or Hilbert space, every Borel probability measure is a Radon measure (i.e., inner regular by compact sets and outer regular by open sets), which guarantees that probability mass can be approximated arbitrarily well by compact “core” sets—an essential property when approximating expectations numerically or proving convergence theorems (Bogachev, 2007). The formal definition of Radon measures is skipped, as it is unnecessary for the discussion of this thesis, where every Banach and Hilbert space is assumed to be separable.
Definition 2.16 Gaussian measure, (Stroock, 2010). A Borel probability measure \(P\) on a Hilbert space \(\mathcal H\) is Gaussian with mean \(\mu\in\mathcal H\) and covariance operator \(\Sigma\colon\mathcal H \to\mathcal H\) if every continuous linear functional \(L:\mathcal H\to \mathbb{R}\) satisfies \(L_{\sharp}P=\mathcal N(L\mu,\,L\Sigma L^{\ast})\). Where \(L_{\#}P\) denotes the push-forward (image) measure of \(P\) under the map \(L\). Formally, for any Borel set \(A\subset\mathbb{R}\), \begin{equation} (L_{\#}P)(A)=P\big(L^{-1}(A)\big). \end{equation} Thus, the statement \(L_{\#}P=\mathcal{N}(L\mu,\,L\Sigma L^{\ast})\) means that \(L(F)\) is Gaussian with mean \(L\mu\) and variance \(L\Sigma L^{\ast}\) whenever \(F\sim P\).
Informally, the above definition states that applying \(L\) to a random draw \(F \sim P\); the result is a real Gaussian random variable. Requiring \(L_\sharp P\) to be Gaussian for every such \(L: \mathcal{H}\to \mathbb{R}\) forces the entire random element to be “Gaussian in all directions”, exactly the infinite-dimensional analogue of a multivariate normal vector. A Gaussian random element should have finite second moment (which equals the trace of the covariance operator (Da Prato & Zabczyk, 2014, Proposition 2.16)): \begin{equation} \mathbb{E}[\|F\|_{\mathcal H}^2] = \mathrm{tr}(\Sigma)\,. \end{equation} If \(\mathrm{tr}(\Sigma) = \infty\) the “expected norm” blows up, then no \(\sigma\)-additive probability measure on \(\mathcal H\) can realise those one-dimensional Gaussians simultaneously. Bogachev’s theorem (Bogachev, 1998) therefore states that a centred Gaussian Radon measure \(\gamma=\mathcal N(0,\Sigma)\) exists if and only if the covariance operator satisfies \(\mathrm{tr}(\Sigma)<\infty\). Intuitively, trace-class requires the eigenvalues of \(\Sigma\) to decay fast enough so that the random element does not wander off to infinity along an ever-increasing number of orthogonal directions. In finite dimensions, any symmetric, positive-definite matrix defines a bona-fide Gaussian probability measure via its quadratic form. The situation changes dramatically in an infinite–dimensional Banach space \(\mathcal B\). Many natural covariance forms, such as the identity or other non–compact operators, fail this trace–class test; yet their finite coordinate projections remain perfectly meaningful. One therefore introduces the notion of a cylindrical Gaussian measure: a finitely additive set function on the algebra generated by linear functionals \(\{h\mapsto\langle e_{i},h\rangle_{H}\}_{i\in\mathbb N}\) that assigns multivariate normal laws to every finite sub-collection (Bogachev, 1998). Cylindrical Gaussians always exist—no trace-class is required—and serve as indispensable “weak” probability objects when modelling phenomena such as space–time white noise or when working with unbounded covariance operators. They provide the correct finite‐dimensional distributions while acknowledging that no \(\sigma\)–additive extension to the whole Borel \(\sigma\)–algebra of \(\cal H\) is possible in the non–trace-class regime.
Definition 2.17 Cylindrical \(\sigma\)–algebra, (Bogachev, 1998). Let \((\mathcal H,\langle\cdot,\cdot\rangle)\) be a (real) Hilbert space. For each \(h\in\mathcal H\) define the coordinate map \(L_h:\mathcal H\to\mathbb R\) by \(L_h(x)=\langle x,h\rangle\). The cylindrical \(\sigma\)–algebra on \(\mathcal H\), written \(\mathscr C(\mathcal H)\), is the smallest \(\sigma\)–algebra that makes all \(L_h\) measurable; equivalently, \begin{equation} \mathscr C(\mathcal H) =\sigma\Bigl\{ (L_{h_1},\ldots,L_{h_k})^{-1}(B) :\ k\in\mathbb N,\ h_1,\ldots,h_k\in\mathcal H,\ B\in\mathscr B(\mathbb R^k) \Bigr\}. \end{equation} Sets of the form \((L_{h_1},\ldots,L_{h_k})^{-1}(B)\) are called cylinders; they depend only on finitely many linear coordinates of \(\mathcal H\).
Definition 2.18 Cylindrical Gaussian measure. Let \(\mathcal H\) be a separable Hilbert space, \(m \in \mathcal{H}\), and \(S\in \mathcal{L}^+(\mathcal H)\) a bounded, self–adjoint, non–negative operator. Denote by \(\mathscr C(\mathcal H)\) the cylindrical \(\sigma\)–algebra. A set function \(P:\mathscr C(\mathcal H)\to[0,1]\) is called a cylindrical Gaussian measure with mean \(m\) and covariance operator \(S\), \(\operatorname{Gauss}_{\mathrm{cyl}}(m,S)\), if, for every finite family of vectors \(h_{1},\dots,h_{k}\in \mathcal H\), one has \begin{equation} (L_{h_{1}},\dots,L_{h_{k}})_{\sharp}P \;=\; \mathcal N\bigl(\bigl(\langle m,h_{1}\rangle,\dots,\langle m,h_{k}\rangle\bigr), [\langle h_{j},S(h_{i})\rangle]\bigr)\,. \end{equation}
From Gaussian Processes to Gaussian Measures
Here, the construction transitions from the object most practitioners know—a GP defined by a mean function and a kernel—to its measure-theoretic avatar. A cylindrical Gaussian distribution in the underlying Hilbert space is built. The takeaway is practical: once you have a kernel, you automatically have a measure, so all the machinery of Gaussian measures becomes available for analysing or approximating your model later on.
The operator viewpoint adopted in this chapter allows us to move back and forth between the kernel definition of a Gaussian process and the measure–theoretic definition that underlies the theory of random elements in Banach spaces. Consider the zero–mean case and show how the kernel alone determines a cylindrical Gaussian distribution in the RKHS.
Theorem 2.19. Let \(K:\mathcal X\times\mathcal X\to\mathbb R\) be a positive-definite kernel, \(\mathcal H\) its RKHS, and \(\phi_x:=K(\,\cdot\,,x)\in\mathcal H\). Define the centred cylindrical Gaussian measure \begin{equation} P_{\textnormal{cyl}} \;=\; \operatorname{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal{H}}) \quad\text{on the cylindrical } \sigma\text{-algebra }\mathscr C(\mathcal H). \end{equation} For every finite index set \(X=\{x_{1},\dots,x_{m}\}\subset\mathcal X\), \begin{equation} (\delta_{x_1},\dots,\delta_{x_m})_{\#}P_{\textnormal{cyl}} \;=\; \mathcal N\bigl(0,\,K(X,X)\bigr)\,, \end{equation} so \(P_{\textnormal{cyl}}\) reproduces exactly the finite-dimensional marginals of the process \(f\sim\mathcal{GP}(0,K)\).
Proof
For each \(x\in\mathcal X\), the evaluation map \(\delta_x:\mathcal H\to\mathbb R\), \(\delta_x(h)=h(x)\), is a continuous linear functional on \(\mathcal H\) because, by the RKHS property with \(\phi_x:=K(\cdot,x)\), \begin{equation} \delta_x(h)=\langle h,\phi_x\rangle_{\mathcal H} \quad\text{and}\quad |\delta_x(h)|\le \|h\|_{\mathcal H}\,\|\phi_x\|_{\mathcal H} = \|h\|_{\mathcal H}\,\sqrt{K(x,x)}. \end{equation} Hence \(\delta_x=L_{\phi_x}\). For a finite set \(X=\{x_1,\dots,x_m\}\), by the push-forward definition and the previous display, \begin{equation} (\delta_{x_1},\dots,\delta_{x_m})_{\sharp}P_{\mathrm{cyl}} = \mathcal N(0,\,[\langle \phi_{x_j},\,\phi_{x_i}\rangle_{\mathcal H}]_{i,j=1}^m). \end{equation} The reproducing property yields \(\langle \phi_{x_j},\,\phi_{x_i}\rangle_{\mathcal H}=K(x_i,x_j)\), so \begin{equation} (\delta_{x_1},\dots,\delta_{x_m})_{\sharp}P_{\mathrm{cyl}} \;=\; \mathcal N\bigl(0,\,K(X,X)\bigr). \end{equation} Therefore \(P_{\mathrm{cyl}}\) reproduces exactly the finite-dimensional marginals of the Gaussian process \(f\sim\mathcal{GP}(0,K)\), as claimed. □
The measure \(P_{\textnormal{cyl}}\) lives only on the cylindrical \(\sigma\)-algebra generated by finite collections of continuous linear functionals on \(\mathcal{H}\); it need not be countably additive on all Borel sets of \(\mathcal{H}\). It verifies that \(I_\mathcal{H} \in L^{+}(\mathcal H)\), but its trace is not finite when \(\mathrm{dim}\ \mathcal{H} = \infty\). As a result, since \(I_\mathcal{H} \notin L^{+}_{\operatorname{tr}}(\mathcal H)\) (is not trace-class), \(P_{\textnormal{cyl}}\) remains purely cylindrical unless \(\mathcal{H}\) is embedded into a larger Banach space (Gross, 1967).
Theorem 2.20. Let \(\bigl(\mathcal H,\langle\cdot,\cdot\rangle_{\mathcal H}\bigr)\) be a separable infinite-dimensional Hilbert space. Then there exist:
a separable Banach space \((\mathcal B,\|\cdot\|_{\cal{B}})\) that contains \(\mathcal H\) as a dense subspace under \(\|\cdot\|_{\cal{B}}\),
a continuous, injective embedding \(i:\mathcal H\hookrightarrow \mathcal B\) such that the push-forward of the canonical cylindrical Gaussian on \(\mathcal H\) with covariance \(I_{\mathcal H}\), \(\operatorname{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H})\), under \(i\) extends uniquely to a centred Gaussian Radon probability measure \(\gamma = \mathcal N_{\mathcal B}(0,J)\) on \(\mathcal B\) with covariance operator \begin{equation} J=i\circ i^{\ast}:\mathcal B^{\ast} \longrightarrow \mathcal B, \end{equation} where \(i^{\ast}:\mathcal B^{\ast}\to\mathcal H\) is the adjoint operator of \(i\).
such that the triple \((\mathcal B,\gamma,\mathcal H)\) is an abstract Wiener space; that is, \(\mathcal H\) is the Cameron–Martin space of \(\gamma\).
Proof
Step 1: Construct a larger (Hilbert) Banach space and the embedding. Let \((e_k)_{k\ge 1}\) be an orthonormal basis of \(\mathcal H\). Choose weights \((\beta_k)_{k\ge 1}\) with \(\beta_k\ge 1\) and \begin{equation} \label{eq:aws-beta-summable-new} \sum_{k=1}^{\infty}\beta_k^{-2}\;<\;\infty, \quad \text{for example}, \ \beta_k = k\,. \end{equation} Define, for \(h\in\mathcal H\), \begin{equation} \label{eq:aws-B-norm-new} \|h\|_{\mathcal B}^{2} \;:=\;\sum_{k=1}^{\infty}\beta_k^{-2}\,|\langle h,e_k\rangle_{\mathcal H}|^{2}. \end{equation} Since \(\beta_k^{-2}\le 1\), there holds \(\|h\|_{\mathcal B}\le \|h\|_{\mathcal H}\) for all \(h\in\mathcal H\); hence, the identity map \begin{equation} \label{eq:aws-id-cont-new} \mathrm{id}:(\mathcal H,\|\cdot\|_{\mathcal H})\longrightarrow (\mathcal H,\|\cdot\|_{\mathcal B}) \end{equation} is continuous. Let \(\mathcal B\) be the completion of \(\mathcal H\) with respect to \(\|\cdot\|_{\mathcal B}\). The sesquilinear form \begin{equation} \label{eq:aws-B-inner-new} \langle h,g\rangle_{\mathcal B} \;:=\;\sum_{k=1}^{\infty}\beta_k^{-2}\,\langle h,e_k\rangle_{\mathcal H}\,\langle g,e_k\rangle_{\mathcal H}, \qquad h,g\in\mathcal H, \end{equation} extends by continuity to an inner product on \(\mathcal B\); therefore \(\mathcal B\) is a separable Hilbert space (in particular, a Banach space), and \(\mathcal H\) is dense in \(\mathcal B\). Let \begin{equation} \label{eq:aws-embed-new} i:\mathcal H\hookrightarrow\mathcal B \end{equation} be the canonical inclusion. From Equation \(\eqref{eq:aws-B-norm-new}\), \begin{equation} \label{eq:aws-HS-norm-new} \|i(e_k)\|_{\mathcal B}^{2}\;=\;\beta_k^{-2}, \qquad \sum_{k=1}^{\infty}\|i(e_k)\|_{\mathcal B}^{2}\;=\;\sum_{k=1}^{\infty}\beta_k^{-2}\;<\;\infty, \end{equation} so \(i\) is Hilbert–Schmidt. This will guarantee the square-summability of the Gaussian series below.
Step 2: Build a \(\mathcal B\)-valued Gaussian by an \(L^{2}\)-convergent series.
Let \((\xi_k)_{k\ge 1}\) be independent standard normal variables on some \((\Omega,\mathscr F,\mathbb P)\) and define \begin{equation} \label{eq:aws-XN-new} X_N\;:=\;\sum_{k=1}^{N}\xi_k\,i(e_k)\quad\in\mathcal B, \qquad N\ge 1. \end{equation} For integers \(M<N\), by the bilinearity of \(\langle\cdot,\cdot\rangle_{\mathcal B}\) and their independence, \begin{align} \label{eq:aws-Cauchy-1-new} \mathbb E\|X_N-X_M\|_{\mathcal B}^{2} &= \mathbb E\Big\langle \sum_{k=M+1}^{N}\xi_k\,i(e_k),\ \sum_{j=M+1}^{N}\xi_j\,i(e_j)\Big\rangle_{\mathcal B} \\ \label{eq:aws-Cauchy-2-new} &= \sum_{k=M+1}^{N}\sum_{j=M+1}^{N}\mathbb E[\xi_k\xi_j]\ \langle i(e_k),i(e_j)\rangle_{\mathcal B} \\ \label{eq:aws-Cauchy-3-new} &= \sum_{k=M+1}^{N}\|i(e_k)\|_{\mathcal B}^{2} \;=\; \sum_{k=M+1}^{N}\beta_k^{-2}. \end{align} Equation \(\eqref{eq:aws-Cauchy-2-new}\) utilises the linearity of expectation; Equation \(\eqref{eq:aws-Cauchy-3-new}\) uses \(\mathbb E[\xi_k\xi_j]=\delta_{kj}\). By Equation \(\eqref{eq:aws-beta-summable-new}\), the right-hand side tends to \(0\) as \(M,N\to\infty\). Thus \((X_N)\) is Cauchy in \(L^{2}(\Omega;\mathcal B)\); hence, there exists a \(\mathcal B\)-valued random element \(X\) with \begin{equation} \label{eq:aws-X-limit-new} X_N\xrightarrow[N\to\infty]{L^{2}(\Omega;\mathcal B)} X. \end{equation} Define the probability measure \begin{equation} \label{eq:aws-gamma-def-new} \gamma \;:=\; \mathbb P\circ X^{-1} \end{equation} on the Borel \(\sigma\)-algebra of \(\mathcal B\). Since \(X\) is an \(L^{2}\)-limit of finite Gaussian sums, \(\gamma\) is a centred Gaussian probability measure on \(\mathcal B\). To justify this, observe that for every \(\ell\in \cal{B}^{\ast}\) we have \begin{equation} \ell(X_N)=\sum_{k=1}^{N}\xi_{k}\,\ell(i(e_{k})), \end{equation} a finite linear combination of independent \(\mathcal N(0,1)\) variables; hence, a centred real Gaussian. Since \(\ell\) is continuous, \begin{equation} \mathbb E\left|\ell(X_N)-\ell(X)\right|^2 \le \|\ell\|^2\,\mathbb E\|X_N-X\|^2 \xrightarrow[N\to\infty]{} 0, \end{equation} so \(\ell(X_N)\to\ell(X)\) in \(L^2(\Omega)\). Moreover, with \(a_k:=\ell(i(e_k))\) we have \begin{equation} \sum_{k\ge1} a_k^2 =\sum_{k\ge1} |\ell(i(e_k))|^2 \le \|\ell\|^2\sum_{k\ge1}\|i(e_k)\|_{\mathcal B}^2 =\|\ell\|^2\sum_{k\ge1}\beta_k^{-2}<\infty, \end{equation} so \(\ell(X)=\sum_{k\ge1} a_k\xi_k\) in \(L^2\), hence \(\ell(X)\sim\mathcal N(0,\sum a_k^2)\). By the standard characterization (a \(\cal B\)–valued law is Gaussian if and only if all its continuous linear functionals are real Gaussian), the push-forward \(\gamma=\mathbb P\circ X^{-1}\) is a centred Gaussian probability measure on \(\cal{B}\).
Step 3: Identify the covariance operator as \(J=i\,i^{\ast}:\mathcal B\to\mathcal B\).
Let \(i^{\ast}:\mathcal B\to\mathcal H\) denote the Hilbert adjoint of \(i\), that is, \begin{equation} \label{eq:aws-adjoint-def-new} \langle i (h),\, v\rangle_{\mathcal B} \;=\; \langle h,\, i^{\ast}(v)\rangle_{\mathcal H} \qquad\text{for all }h\in\mathcal H,\ v\in\mathcal B. \end{equation} Fix \(v,w\in\mathcal B\). Using Equation \(\eqref{eq:aws-X-limit-new}\) and the Cauchy–Schwarz inequality in \(L^{2}\), the mapping \(Y\mapsto \mathbb E[\langle Y,v\rangle_{\mathcal B}\langle Y,w\rangle_{\mathcal B}]\) is continuous for \(L^{2}\)-convergence; hence, \begin{equation} \label{eq:aws-cov-limit-pass-new} \mathbb E\bigl[\langle X,v\rangle_{\mathcal B}\,\langle X,w\rangle_{\mathcal B}\bigr] \;=\; \lim_{N\to\infty}\mathbb E\bigl[\langle X_N,v\rangle_{\mathcal B}\,\langle X_N,w\rangle_{\mathcal B}\bigr]. \end{equation} Expanding \(X_N\) from Equation \(\eqref{eq:aws-XN-new}\) and using bilinearity, \begin{align} \label{eq:aws-cov-expand-new} \mathbb E\bigl[\langle X_N,v\rangle_{\mathcal B}\,\langle X_N,w\rangle_{\mathcal B}\bigr] &= \mathbb E\Bigl[\Bigl\langle \sum_{k=1}^{N}\xi_k i(e_k),\, v\Bigr\rangle_{\mathcal B} \Bigl\langle \sum_{j=1}^{N}\xi_j i(e_j),\, w\Bigr\rangle_{\mathcal B}\Bigr] \\ \label{eq:aws-cov-double-new} &= \sum_{k=1}^{N}\sum_{j=1}^{N} \mathbb E[\xi_k\xi_j]\, \langle i(e_k),v\rangle_{\mathcal B}\,\langle i(e_j),w\rangle_{\mathcal B} \\ \label{eq:aws-cov-collapse-new} &= \sum_{k=1}^{N} \langle i(e_k),v\rangle_{\mathcal B}\,\langle i(e_k),w\rangle_{\mathcal B}. \end{align} Equation \(\eqref{eq:aws-cov-double-new}\) uses the linearity of expectation; Equation \(\eqref{eq:aws-cov-collapse-new}\) uses \(\mathbb E[\xi_k\xi_j]=\delta_{kj}\). Now use the definition of the adjoint, Equation \(\eqref{eq:aws-adjoint-def-new}\), to rewrite \begin{equation} \label{eq:aws-adjoint-step-new} \langle i(e_k),v\rangle_{\mathcal B}\;=\;\langle e_k,\,i^{\ast}(v)\rangle_{\mathcal H}. \end{equation} Substituting Equation \(\eqref{eq:aws-adjoint-step-new}\) into Equation \(\eqref{eq:aws-cov-collapse-new}\) and letting \(N\to\infty\), the Parseval identity in \(\mathcal H\) yields \begin{align} \label{eq:aws-parseval-new} \mathbb E\bigl[\langle X,v\rangle_{\mathcal B}\,\langle X,w\rangle_{\mathcal B}\bigr] &= \sum_{k=1}^{\infty} \langle e_k,\,i^{\ast}(v)\rangle_{\mathcal H}\,\langle e_k,\,i^{\ast}(w)\rangle_{\mathcal H} \\ \label{eq:aws-cov-J-new} &= \langle i^{\ast}(v),\,i^{\ast}(w)\rangle_{\mathcal H} \;=\; \langle i\,(i^{\ast}(v)),\, w\rangle_{\mathcal B}. \end{align} Thus, the covariance operator of \(\gamma\) is \begin{equation} \label{eq:aws-J-def-new} J\;=\;i\circ i^{\ast}:\mathcal B\longrightarrow\mathcal B. \end{equation}
Step 4: \(J\) is trace class; explicit diagonalisation.
Define \begin{equation} \label{eq:aws-gk-def-new} g_k\;:=\;\beta_k\,i(e_k)\in\mathcal B. \end{equation} From Equation \(\eqref{eq:aws-B-inner-new}\) and Equation \(\eqref{eq:aws-HS-norm-new}\), \begin{equation} \label{eq:aws-gk-onb-new} \langle g_k,g_j\rangle_{\mathcal B} \;=\;\beta_k\beta_j\,\langle i(e_k),i(e_j)\rangle_{\mathcal B} \;=\;\beta_k\beta_j\,\beta_k^{-2}\delta_{kj} \;=\;\delta_{kj}, \end{equation} so \((g_k)_{k\ge 1}\) is an orthonormal basis of \(\mathcal B\). Moreover, for \(e_k\in\mathcal H\), \begin{equation} \label{eq:aws-iast-i-onH-new} \langle (i^{\ast}\circ i)e_k,\,e_j\rangle_{\mathcal H} \;=\;\langle i(e_k),\,i(e_j)\rangle_{\mathcal B} \;=\;\beta_k^{-2}\delta_{kj}, \end{equation} hence \((i^{\ast}\circ i)e_k=\beta_k^{-2}e_k\). Using Equation \(\eqref{eq:aws-J-def-new}\), \begin{align} \label{eq:aws-J-on-gk-new} J (g_k) &= i\circ i^{\ast}\bigl(\beta_k i(e_k)\bigr) \;=\;\beta_k\, i\bigl((i^{\ast}\circ i)e_k\bigr) \;=\;\beta_k\, i(\beta_k^{-2}e_k) \;=\;\beta_k^{-2}\,g_k. \end{align} Therefore \(J\) is diagonal in the orthonormal basis \((g_k)\) with eigenvalues \((\beta_k^{-2})\), and \begin{equation} \label{eq:aws-traceJ-new} \operatorname{tr}J \;=\;\sum_{k=1}^{\infty}\langle J (g_k),\,g_k\rangle_{\mathcal B} \;=\;\sum_{k=1}^{\infty}\beta_k^{-2} \;<\;\infty, \end{equation} so \(J\) is trace class. In particular, \(\gamma=\mathcal N_{\mathcal B}(0,J)\) is a centred Gaussian probability measure on the separable Hilbert space \(\mathcal B\), hence a Radon measure (every Borel probability on a separable Hilbert space is Radon).
Step 5: The push-forward of the cylindrical Gaussian equals \(\gamma\).
Let \(\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H})\) denote the standard cylindrical Gaussian pre-measure on \(\mathcal H\). Push it forward to \(\mathcal B\) by \(i:\mathcal H\to\mathcal B\) and set \begin{equation} \mu\;:=\;i_{\sharp}\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H}). \end{equation} a cylindrical pre-measure on \(\mathcal B\) defined on the cylindrical \(\sigma\)-algebra generated by \(\mathcal B^{\ast}\). Because \(\mathcal B\) is a Hilbert space, the Riesz map \(R_{\mathcal B}:\mathcal B\to\mathcal B^{\ast}\), \(R_{\mathcal B}(v)=\langle \,\cdot\,,v\rangle_{\mathcal B}\), is an isometric isomorphism. For each \(\ell\in\mathcal B^{\ast}\) let \(v_\ell:=R_{\mathcal B}^{-1}(\ell)\in\mathcal B\). Then for \(x\in\mathcal H\), \begin{equation} (\ell\circ i)(x)=\ell(i(x))=\langle i(x),v_\ell\rangle_{\mathcal B} =\langle x,\,i^{\ast}(v_\ell)\rangle_{\mathcal H}. \end{equation} Thus \(L_\ell:=\ell\circ i\in\mathcal H^{\ast}\) is represented by \(h_\ell:=i^{\ast}(v_\ell)\in\mathcal H\). Fix \(\ell_1,\dots,\ell_m\in\mathcal B^{\ast}\). By definition of \(\mu\) and of the cylindrical Gaussian on \(\mathcal H\), \begin{equation} \label{eq:aws-step5-mu-marg} (\ell_1,\dots,\ell_m)_{\sharp}\mu \;=\;(L_{\ell_1},\dots,L_{\ell_m})_{\sharp}\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H}) \;=\;\mathcal N\bigl(0,\,[\,\langle h_{\ell_p},h_{\ell_q}\rangle_{\mathcal H}\,]_{p,q}\bigr). \end{equation} By Step 2, for every \(\ell\in\mathcal B^{\ast}\), \begin{equation} \ell(X)=\sum_{k\ge1}\xi_k\,\ell(i(e_k))\quad\text{in }L^2(\Omega), \qquad \sum_{k\ge1}|\ell(i(e_k))|^2 \le \|\ell\|^2\sum_{k\ge1}\beta_k^{-2}<\infty, \end{equation} so \(\ell(X)\) is centred Gaussian with variance \(\sum_{k\ge1}\ell(i(e_k))^2\). For \(\ell_p,\ell_q\in\mathcal B^{\ast}\), \begin{equation} \mathbb E[\ell_p(X)\,\ell_q(X)] \;=\;\sum_{k\ge1}\ell_p(i(e_k))\,\ell_q(i(e_k)) \;=\;\sum_{k\ge1}\langle e_k,\,h_{\ell_p}\rangle_{\mathcal H}\,\langle e_k,\,h_{\ell_q}\rangle_{\mathcal H} \;=\;\langle h_{\ell_p},h_{\ell_q}\rangle_{\mathcal H}, \end{equation} where we used \(\ell(i(e_k))=\langle i(e_k),v_\ell\rangle_{\mathcal B}=\langle e_k,i^{\ast}(v_\ell)\rangle_{\mathcal H}\) and Parseval in \(\mathcal H\). Hence \begin{equation} \label{eq:aws-step5-gamma-marg} (\ell_1,\dots,\ell_m)_{\sharp}\gamma=\mathcal N\bigl(0,\,[\,\langle h_{\ell_p},h_{\ell_q}\rangle_{\mathcal H}\,]_{p,q}\bigr). \end{equation} Comparing Equations \(\eqref{eq:aws-step5-mu-marg}\) and \(\eqref{eq:aws-step5-gamma-marg}\) shows that \(\mu\) and \(\gamma\) agree on all finite-dimensional distributions determined by \(\mathcal B^{\ast}\). Let \(\mathcal C(\mathcal B^{\ast})\) be the cylindrical \(\sigma\)-algebra generated by \(\{\ell:\mathcal B\to\mathbb R:\ell\in\mathcal B^{\ast}\}\). The previous paragraph implies \(\mu=\gamma\) on \(\mathcal C(\mathcal B^{\ast})\). Since \(\mathcal B\) is separable, \(\mathcal C(\mathcal B^{\ast})=\mathscr B(\mathcal B)\) (Bogachev, 1998; Kuo, 2006); hence, \(\mu=\gamma\) on \(\mathscr B(\mathcal B)\).
Step 6: Identify the Cameron–Martin space.
Let \(J^{1/2}\) be the unique positive square root of \(J=i\,i^{\ast}\). The Cameron–Martin space \(\mathcal H_{\gamma}\) is \(\mathrm{Ran}\,J^{1/2}\) with the inner product \begin{equation} \label{eq:aws-CM-inner-new} \langle u,v\rangle_{\mathcal H_{\gamma}} \;:=\;\langle J^{-1/2}(u),\,J^{-1/2}(v)\rangle_{\mathcal B}. \end{equation} Using Equation \(\eqref{eq:aws-J-on-gk-new}\), \(J^{1/2}(g_k)=\beta_k^{-1}g_k\). For \(h=\sum_{k}a_k e_k\in\mathcal H\), \begin{equation} \label{eq:aws-i-exp-new} i(h)\;=\;\sum_{k=1}^{\infty}a_k\,i(e_k) \;=\;\sum_{k=1}^{\infty}a_k\,\beta_k^{-1}g_k, \qquad J^{-1/2}i(h)\;=\;\sum_{k=1}^{\infty}a_k\,g_k. \end{equation} Hence, for \(h,g\in\mathcal H\), \begin{equation} \label{eq:aws-CM-iso-new} \langle i(h),\,i(g)\rangle_{\mathcal H_{\gamma}} \;=\;\Bigl\langle \sum_{k}a_k g_k,\ \sum_{k}b_k g_k\Bigr\rangle_{\mathcal B} \;=\;\sum_{k=1}^{\infty}a_k b_k \;=\;\langle h,g\rangle_{\mathcal H}. \end{equation} Thus \(i:\mathcal H\to\mathcal H_{\gamma}\) is an isometric isomorphism onto its range and \begin{equation} \label{eq:aws-CM-range-new} \mathrm{Ran}\,J^{1/2}\;=\;i(\mathcal H). \end{equation} Therefore \(\mathcal H\) is the Cameron–Martin space of \(\gamma\).
The construction provides a separable Banach space \(\mathcal B\) containing \(\mathcal H\) densely, a continuous injective map \(i:\mathcal H\hookrightarrow\mathcal B\) such that \(i_{\sharp}\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H})\) extends to the centred Gaussian Radon measure \(\gamma=\mathcal N_{\mathcal B}(0,J)\) with covariance \(J=i\,i^{\ast}\), and the Cameron–Martin space of \(\gamma\) is \(\mathcal H\). This proves the theorem. □
The abstract Wiener space \((\mathcal B,\gamma,\mathcal H)\) built in Theorem 2.20 provides a universal “background”: the Radon measure \(\gamma=\mathcal N_{\mathcal B}(0,J)\) is nothing but white noise on \(\mathcal H\), now made \(\sigma\)–additive by relaxing the norm. Shifting this noise by any Cameron–Martin element and using the associated Paley-Wiener map (Bogachev, 1998) produces a Gaussian measure whose finite–dimensional marginals coincide with those of a GP.
Theorem 2.21. Let \(f \sim \mathcal{GP}\bigl(m,K\bigr)\), with \(m:\mathcal X\to\mathbb R\) and \(K:\mathcal X\times\mathcal X\to\mathbb R\). Let \(m\in\mathcal H\), where \(\mathcal H\) is the RKHS generated by \(K\) with sections \(\phi_x:=K(\,\cdot\,,x)\). Let \((\mathcal B,\gamma,\mathcal H)\) be the abstract Wiener space constructed in Theorem 2.20; write \(i:\mathcal H\hookrightarrow\mathcal B\) for the embedding and let \(J=i\circ i^{\ast}\) be its covariance operator so that \(\gamma=\mathcal N_{\mathcal B}(0,J)\). Let \(Z\sim\gamma\) and define \(F := i(m) + Z\). Then:
\(F\) is a \(\mathcal B\)-valued Gaussian random element with law \begin{equation} F \sim \mathcal N_{\mathcal B}\bigl(i(m),\;J\bigr). \end{equation}
Let \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) denote the Paley–Wiener map associated with \((\mathcal B,\gamma,\mathcal H)\), i.e. for \(h\in\mathcal H\), \(\mathcal W(h)\) is the \(\gamma\)-measurable linear functional on \(\mathcal B\) satisfying \(\mathcal W(h)\bigl(i (g)\bigr)=\langle h,g\rangle_{\mathcal H}\) for all \(g\in\mathcal H\). Then, for every finite set \(X=\{x_{1},\dots,x_{n}\}\subset\mathcal X\), \begin{equation} \bigl(\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n})\bigr)_{\sharp} \mathcal N_{\mathcal B}\bigl(i(m),\;J\bigr) \;=\; \mathcal N\bigl(m(X),\,K(X,X)\bigr). \end{equation} Equivalently, \(\bigl(\mathcal W(\phi_{x_1})(F),\dots,\mathcal W(\phi_{x_n})(F)\bigr) \sim \mathcal N\bigl(m(X),\,K(X,X)\bigr)\).
Remark. If, in addition, the point evaluations \(\delta_x\) are continuous on \(\mathcal B\) (i.e. \(\delta_x\in\mathcal B^{\ast}\) for all \(x\in\mathcal X\)), then \(\mathcal W(\phi_x)=\delta_x\) \(\gamma\)-a.s., and the display in (2) coincides with the push-forward by \((\delta_{x_1},\dots,\delta_{x_n})\).
Proof
Let \((\mathcal B,\gamma,\mathcal H)\) be the abstract Wiener space produced by Theorem 2.20. Thus \(\mathcal B\) is a separable Hilbert space, \(i:\mathcal H\hookrightarrow\mathcal B\) is continuous and injective, and \(\gamma=\mathcal N_{\mathcal B}(0,J)\) is a centred Gaussian Radon measure with covariance \(J\).
Step 1: \(F=i(m)+Z\) is Gaussian with mean \(i(m)\) and covariance \(J\).
Let \(Z\sim\gamma=\mathcal N_{\mathcal B}(0,J)\) and set \begin{equation} \label{eq:def-F} F \;:=\; i(m)+Z . \end{equation} For any \(v\in\mathcal B\), \(\langle F,v\rangle_{\mathcal B} = \langle i(m),v\rangle_{\mathcal B} + \langle Z,v\rangle_{\mathcal B}\) is the sum of a constant and a centred real Gaussian; hence \(\langle F,v\rangle_{\mathcal B}\) is real Gaussian with mean \(\langle i(m),v\rangle_{\mathcal B}\) and variance \(\langle Jv,v\rangle_{\mathcal B}\). Since this holds simultaneously for all finite families \(v_1,\dots,v_n\) by linearity, \(F\) is a \(\mathcal B\)–valued Gaussian random element with \begin{equation} \label{eq:law-F} F \sim \mathcal N_{\mathcal B}\bigl(i(m),\,J\bigr) \,, \end{equation} which proves item (1).
Step 2: Paley–Wiener map and its basic identities.
Write \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) for the Paley–Wiener map associated with \((\mathcal B,\gamma,\mathcal H)\). By definition, for every \(h,g\in\mathcal H\), \begin{equation} \label{eq:PW-reproducing} \mathcal W(h)\bigl(i(g)\bigr) \;=\; \langle h,\,g\rangle_{\mathcal H} . \end{equation} Moreover, \(\mathcal W\) is linear and isometric into \(L^{2}(\gamma)\): \begin{equation} \label{eq:PW-isometry} \mathbb E_{\gamma}\bigl[\mathcal W(h)\,\mathcal W(g)\bigr] \;=\; \langle h,\,g\rangle_{\mathcal H}, \qquad \mathbb E_{\gamma}\bigl[\mathcal W(h)\bigr]=0 . \end{equation} Since \(F\) has the translated law \(T_{i(m)\,\sharp}\gamma\), with \(T_{i(m)}(x) := x+i(m)\), and translations by \(i(m)\) with \(m\in\mathcal H\) being quasi-invariant (Cameron–Martin Theorem), \(\mathcal W(h)\) is defined \(F\)-a.s. as well. This means that any set where \(W(h)\) fails to be defined has \(\gamma\)-measure zero, and by quasi-invariance, it also has zero measure under the translated law of \(F\). Therefore, \(W(h)(F)\) exists with probability one.
Step 3: Mean and covariance of \(\mathcal W(h)(F)\).
By linearity of \(\mathcal W(h)\) and Equation \(\eqref{eq:def-F}\), \begin{equation} \label{eq:W-on-F-split} \mathcal W(h)(F) \;=\; \mathcal W(h)\bigl(i(m)\bigr) \;+\; \mathcal W(h)(Z). \end{equation} Using Equation \(\eqref{eq:PW-reproducing}\) with \(g=m\), \begin{equation} \label{eq:mean-WF} \mathbb E\bigl[\mathcal W(h)(F)\bigr] \;=\; \langle h,\,m\rangle_{\mathcal H}. \end{equation} For \(h,g\in\mathcal H\), Equation \(\eqref{eq:PW-isometry}\) and the fact that \(Z\) is centred yield \begin{equation} \label{eq:cov-WF} \operatorname{Cov}\!\big(\mathcal W(h)(F),\,\mathcal W(g)(F)\big) \;=\; \operatorname{Cov}\!\big(\mathcal W(h)(Z),\,\mathcal W(g)(Z)\big) \;=\; \langle h,\,g\rangle_{\mathcal H}. \end{equation}
Step 4: Finite-dimensional laws for the sections \(h=\phi_{x}\).
Let \(X=\{x_1,\dots,x_n\}\subset\mathcal X\) and set \(h_j:=\phi_{x_j}=K(\cdot,x_j)\in\mathcal H\). By the reproducing property of the RKHS, \begin{equation} \label{eq:RKHS-reprod} \langle \phi_{x_i},\,m\rangle_{\mathcal H} \;=\; m(x_i), \qquad \langle \phi_{x_i},\,\phi_{x_j}\rangle_{\mathcal H} \;=\; K(x_i,x_j). \end{equation} Combining Equations \(\eqref{eq:mean-WF}\), \(\eqref{eq:cov-WF}\), and \(\eqref{eq:RKHS-reprod}\) gives \begin{equation} \label{eq:mean-cov-vector} \mathbb E\!\begin{bmatrix} \mathcal W(\phi_{x_1})(F) \\[-2pt] \vdots \\[-2pt] \mathcal W(\phi_{x_n})(F) \end{bmatrix} \;=\; \begin{bmatrix} m(x_1) \\[-2pt] \vdots \\[-2pt] m(x_n) \end{bmatrix}, \qquad \operatorname{Cov}\!\begin{bmatrix} \mathcal W(\phi_{x_1})(F) \\[-2pt] \vdots \\[-2pt] \mathcal W(\phi_{x_n})(F) \end{bmatrix} \;=\; K(X,X). \end{equation} Finally, since \(F\) is Gaussian in \(\mathcal B\) and each \(\mathcal W(\phi_{x_j})\) is linear, the vector \[\bigl(\mathcal W(\phi_{x_1})(F),\dots,\mathcal W(\phi_{x_n})(F)\bigr)\] is multivariate normal with mean and covariance as in Equation \(\eqref{eq:mean-cov-vector}\). Hence \begin{equation} \label{eq:pushforward-PW} \bigl(\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n})\bigr)_{\sharp} \mathcal N_{\mathcal B}\bigl(i(m),J\bigr) \;=\; \mathcal N\bigl(m(X),\,K(X,X)\bigr), \end{equation} which proves item (2).
Remark on evaluations.
If, in addition, every point-evaluation \(\delta_x\) is continuous on \(\mathcal B\) (equivalently \(\delta_x\in\mathcal B^{\ast}\)), then the Paley–Wiener functional coincides \(\gamma\)-a.s. with evaluation: \begin{equation} \label{eq:PW-equals-eval} \mathcal W(\phi_x) \;=\; \delta_x \quad \gamma\text{-a.s.} \end{equation} In that case, Equation \(\eqref{eq:pushforward-PW}\) is the stated identity with \((\delta_{x_1},\dots,\delta_{x_n})\). □
In summary, every Gaussian process \(\mathcal{GP}(m,K)\) can be represented as a Gaussian measure \(\mathcal N_{\mathcal B}(i(m), J)\) on a Banach space obtained by embedding its RKHS and shifting the resulting white noise by the mean element \(m\).
From Gaussian Measures to Gaussian Processes
The companion subsection completes the round‑trip. Starting with nothing more than abstract “white‑noise” in a Banach space, it shows that colouring this noise with a positive operator \(\Sigma\) and optionally shifting it by a mean function \(m\) yields a random element whose finite‑dimensional evaluations follow exactly the GP distribution one would write down with the kernel \begin{equation} K_\Sigma(x,x') \;=\; \langle \varphi_x, \Sigma (\varphi_{x'}) \rangle. \end{equation} This construction is extremely flexible—tweaking \(\Sigma\) lets you generate whole families of kernels while staying within the same Hilbert geometry—and it underpins the variational families and kernel‑learning methods developed later in the thesis. Whenever Chapter 4.3 “optimizes \(\Sigma\)”, the covariance of this Gaussian measure is tuned, confident that the result is still a legitimate GP.
From now on, we begin with a Gaussian measure—nothing more than white noise on the abstract Wiener space introduced in Theorem 2.20—and show how every choice of a positive operator \(\Sigma\) and a Cameron–Martin shift \(m\) turns that measure into a fully fledged Gaussian process. The construction is extremely flexible: by varying \(\Sigma\), you obtain an entire family of covariance kernels while remaining within the same Hilbert geometry.
Theorem 2.22. Let \(K\) be a positive-definite kernel with RKHS \(\mathcal H\) (sections \(\phi_x:=K(\cdot,x)\)). Let \((\mathcal B,\gamma,\mathcal H)\) be an abstract Wiener space with continuous embedding \(i:\mathcal H\hookrightarrow\mathcal B\). Fix \(m\in\mathcal H\) and \(\Sigma\in\mathcal L^{+}(\mathcal H)\).
\(P := \mathcal N_{\mathcal B}\bigl(i(m),\; i\Sigma i^{\ast}\bigr)\) is a well-defined (Radon) Gaussian measure on \(\mathcal B\) with mean \(i(m)\) and covariance operator \(C=i\Sigma i^{\ast}\).
Let \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) denote the Paley–Wiener map associated with \((\mathcal B,\gamma,\mathcal H)\), i.e. for \(h\in\mathcal H\), \(\mathcal W(h)\) is the \(\gamma\)-measurable linear functional on \(\mathcal B\) satisfying \(\mathcal W(h)\bigl(i (g)\bigr)=\langle h,g\rangle_{\mathcal H}\) for all \(g\in\mathcal H\). Then, for every finite set \(X=\{x_{1},\dots,x_{n}\}\subset\mathcal X\), \begin{equation} \bigl(\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n})\bigr)_{\sharp} P \;=\; \mathcal N\bigl(m(X),\,K_\Sigma(X,X)\bigr). \end{equation}
Remark. If, in addition, the point evaluations \(\delta_x\) are continuous on \(\mathcal B\) (i.e. \(\delta_x\in\mathcal B^{\ast}\) for all \(x\in\mathcal X\)), then \(\mathcal W(\phi_x)=\delta_x\) \(\gamma\)-a.s., and the display in (2) coincides with the push-forward by \((\delta_{x_1},\dots,\delta_{x_n})\).
Proof
Let \((\mathcal B,\gamma,\mathcal H)\) be an abstract Wiener space with continuous injection \(i:\mathcal H\hookrightarrow\mathcal B\). Fix \(m\in\mathcal H\) and a bounded, self-adjoint, positive operator \(\Sigma\in\mathcal L^{+}(\mathcal H)\). Set \begin{equation} C \;:=\; i\,\Sigma\,i^{\ast}\,:\ \mathcal B^{\ast}\longrightarrow \mathcal B . \end{equation}
(1) Existence and form of the Gaussian measure. Because \(i\) is the abstract Wiener space embedding, it is \(\gamma\)-radonifying (Bogachev, 1998; Hytönen et al., 2018); hence \(i\Sigma^{1/2}\) is also \(\gamma\)-radonifying (Hytönen et al., 2018). Let \(\xi\) be an isonormal Gaussian process over \(\mathcal H\) (i.e. a centred Gaussian family with \(\mathbb E\langle h,\xi\rangle=0\) and \(\mathrm{Cov}(\langle h,\xi\rangle,\langle g,\xi\rangle)=\langle h,g\rangle_{\mathcal H}\)). Define the \(\mathcal B\)-valued random elements \begin{equation} Y \;:=\; (i\Sigma^{1/2})\,\xi, \qquad F \;:=\; i(m) + Y . \end{equation} Standard properties of radonifying operators (Bogachev, 1998; Hytönen et al., 2018) imply that \(Y\) is a centred Radon Gaussian in \(\mathcal B\) with covariance operator \(C=i\Sigma i^{\ast}\), so \(F\) is Gaussian Radon with mean \(i(m)\) and covariance \(C\). Equivalently, \(F\sim \mathcal N_{\mathcal B}\bigl(i(m),\,i\Sigma i^{\ast}\bigr)=:P\).
(2) GP marginals under Paley–Wiener evaluations. Let \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) be the Paley–Wiener map of the AWS; for \(h\in\mathcal H\), \(\mathcal W(h)\) is a (measurable) linear functional on \(\mathcal B\) satisfying \(\mathcal W(h)\bigl(i(g)\bigr)=\langle h,g\rangle_{\mathcal H}\) for all \(g\in\mathcal H\). (In particular, these linear functionals admit representatives that are defined \(P\)-a.s.) For the RKHS sections \(\phi_x:=K(\cdot,x)\), the reproducing property gives \begin{equation} \mathbb E\big[\mathcal W(\phi_x)(F)\big] = \mathcal W(\phi_x)\bigl(i(m)\bigr) = \langle \phi_x,m\rangle_{\mathcal H} = m(x). \end{equation} For the covariance, the shift \(i(m)\) drops out and we compute with \(Y\): \begin{equation} \mathcal W(\phi_x)(Y) = \mathcal W(\phi_x)\!\left(i\!\left(\Sigma^{1/2}\xi\right)\right) = \big\langle \phi_x,\;\Sigma^{1/2}\xi\big\rangle_{\mathcal H} = \big\langle \Sigma^{1/2}\phi_x,\;\xi\big\rangle, \end{equation} so \begin{equation} \mathrm{Cov}\!\left(\mathcal W(\phi_x)(F),\mathcal W(\phi_{x'})(F)\right) = \Big\langle \Sigma^{1/2}\phi_x,\;\Sigma^{1/2}\phi_{x'}\Big\rangle_{\mathcal H} = \big\langle \phi_x,\;\Sigma\phi_{x'}\big\rangle_{\mathcal H} =: K_{\Sigma}(x,x'). \end{equation} Therefore, for any finite \(X=\{x_1,\dots,x_n\}\subset\mathcal X\), \begin{equation} \bigl(\mathcal W(\phi_{x_1})(F),\dots,\mathcal W(\phi_{x_n})(F)\bigr) \sim \mathcal N\bigl(m(X),\,K_{\Sigma}(X,X)\bigr), \end{equation} which is exactly the GP law with kernel \(K_{\Sigma}\).
Remark (point evaluations). If, in addition, each \(\delta_x\in\mathcal B^{\ast}\), then for all \(g\in\mathcal H\), \begin{equation} \delta_x\bigl(i(g)\bigr)=g(x)=\langle \phi_x,g\rangle_{\mathcal H} =\mathcal W(\phi_x)\bigl(i(g)\bigr), \end{equation} so \(\delta_x\) and \(\mathcal W(\phi_x)\) agree on the dense subspace \(i(\mathcal H)\) and hence coincide \(\gamma\)-a.s. (and thus \(P\)-a.s.). The push-forward statement can then be written with \((\delta_{x_1},\dots,\delta_{x_n})\) in place of \((\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n}))\). □
Theorem 2.22 therefore shows that coloring abstract white noise with \(\Sigma\) and adding a shift \(m\) is sufficient to construct the familiar Gaussian process \(\mathcal{GP}\bigl(m,K_{\Sigma}\bigr)\). Observe that the kernel \begin{equation} K_{\Sigma}(x,x')=\bigl\langle\phi_x,\Sigma(\phi_{x'})\bigr\rangle_{\mathcal H} \end{equation} need not coincide with the original kernel \(K\); by varying the operator \(\Sigma\) one obtains an entire family of Gaussian processes while working inside the same Hilbert space \(\mathcal H\).
As an abuse of notation, instead of \(\mathcal{N}_{\mathcal{B}}(i(m), i\Sigma i^{\ast})\), we will use \(\mathcal{N}(f|m, \Sigma)\) to denote the Gaussian law.
Remark. The zero-mean GP prior \(\mathcal{GP}(0, K)\) is obtained from the dual formulation using \(\mathcal{N}(f|0, I)\). Furthermore, given a set of observations \((\mathbf X, \mathbf{y})\) from a regression task with Gaussian noise \(\sigma^2\), the GP posterior is obtained from the (posterior) Gaussian measure \(P(f|\mathbf{y}) = \mathcal{N}(f|\mu^\star, \Sigma^\star)\) where: \begin{align} \mu^\star & = \textstyle \sum_{i=1}^N \alpha_i \phi_{\mathbf{x}_i}\,, \\ \Sigma^\star(\phi) &= \textstyle \phi - \sum_{i=1}^N \sum_{j=1}^N \phi_{\mathbf{x}_i} \Lambda_{ij} \braket{\phi_{\mathbf{x}_j}, \phi}_\mathcal{H}\,, \end{align} with \(\bm{\Lambda} = (K(\mathbf{X}, \mathbf{X}) + \sigma^2 \bm{I})^{-1} \in \mathbb{R}^{N \times N}\) and \(\bm{\alpha} = \bm{\Lambda}\bm{y} \in \mathbb{R}^N\).
As seen in this chapter, the pair \((\mu,\Sigma)\) provides a complete parameterisation of Gaussian processes. We use \(\mathcal{P}_{\mathcal{H}}\) to denote the set of Gaussian measures: \begin{equation} \mathcal{P}_{\mathcal{H}} = \Big\{\mathcal{N}(f|\mu, \Sigma) \ : \ \mu \in \mathcal{H}, \ \Sigma \in \mathcal L^{+}(\mathcal H)\Big\}\,. \end{equation}
Probably Approximately Correct Bounds
Rigorous performance guarantees are indispensable when one seeks to provide statistical–learning algorithms with the full status of scientific tools. Within the Probably Approximately Correct (PAC) framework of Vapnik and Chervonenkis, such guarantees are delivered by PAC bounds—inequalities that hold with high probability over the random draw of the training sample and relate the true (generalization) risk of a predictor to its empirical risk estimated on finite data. The aim of this section is to give a formally self-contained introduction to PAC bounds.
Let \((\mathbf x,y)\sim \nu\) be an unknown distribution on \(\mathcal X\times\mathcal Y\), where \(\mathcal X\subset\mathbb R^{d}\) denotes the input space and \(\mathcal Y\) the label space. A predictor is a measurable map \(f_{\bm{\theta}}\colon\mathcal X\to\mathcal Y\) indexed by a parameter \(\bm{\theta}\in\bm \Theta\subseteq\mathbb R^{m}\). Performance is assessed through a bounded loss \(\ell\colon\mathcal Y^{2}\to[0,C]\), so that the (expected) risk is \begin{equation} L(\bm{\theta})=\mathbb E_{(\mathbf x,y)\sim \nu} \bigl[\ell\bigl(f_{\bm{\theta}}(\mathbf x),y\bigr)\bigr], \end{equation} while the empirical risk on a sample \(D=\{(\mathbf x_{i},y_{i})\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}\nu^{n}\) is \begin{equation} \hat{L}(D,\bm{\theta}) =\tfrac1n\sum_{i=1}^{n}\ell\bigl(f_{\bm{\theta}}(\mathbf x_{i}),y_{i}\bigr). \end{equation} A learning algorithm is a (possibly randomized) estimator \(\hat{\bm{\theta}}=\hat{\bm{\theta}}(D)\). In empirical-risk minimization (ERM) one takes \(\hat{\bm{\theta}}_{\mathrm{ERM}} \in\argmin_{\bm{\theta}\in\bm\Theta}\hat L(D,\bm{\theta}).\)
For a finite hypothesis set \(\bm \Theta=\{\bm{\theta}_{1},\dots,\bm{\theta}_{M}\}\), Hoeffding’s inequality combined with the union bound yields, for any \(\delta\in(0,1)\), the foundational result of McAllester (1999; Seeger, 2002): \begin{equation} \mathbb{P}_{D\sim \nu^{n}}\left( L(\hat{\bm{\theta}}_{\mathrm{ERM}}) \leq \min_{\bm{\theta}\in\Theta}\ \hat L(D,\bm{\theta}) + C\sqrt{\tfrac{\log(M/\delta)}{2n}} \right) \geq 1-\delta. \end{equation} Although historically important, this bound is unsatisfactory in high‑ or infinite‑dimensional settings because the complexity term scales with the logarithm of the number of hypothesis, \(\log M\), and the argument treats ERM only implicitly via a worst‑case analysis.
Uniform Convergence for Finite Hypothesis Classes
In order to build intuition for the more sophisticated PAC–Bayesian machinery that will occupy the remainder of this chapter, the analysis begins with the classical setting in which the hypothesis class is finite. Although this special case is arguably of limited practical relevance in modern large‑scale learning, it is historically foundational and constitutes the simplest instance where one observes, in a crystalline manner, the tension between approximation and estimation.
Throughout this thesis the PAC–Bayes template “with probability at least \(1-\delta\) over the draw of the sample \(D\sim\nu^{n}\), for all \(\bm{\theta} \in \bm{\Theta}\), simultaneously ...”. will be used. Formally, this means that for any confidence level \(\delta \in (0,1)\), the corresponding event \(\text{\textbf{Event}}(\bm{\theta},D, \delta)\) will hold for every \(\bm{\theta} \in \bm{\Theta}\) unless you have drawn a “bad” data set \(D \sim \nu^n\); that is, the event holds with probability at least \(1-\delta\). Mathematically: \begin{equation*} \forall \delta \in (0,1), \quad \mathbb{P}_{D\sim \nu^n}\Big( \textstyle \bigcap \limits_{\bm{\theta} \in \bm{\Theta}} \textbf{Event}(\bm{\theta},D, \delta)\Big)\geq 1-\delta\,. \end{equation*} The same reading applies to other instances in the text where the “event” may be an inequality (e.g. \(A\le B\)) or an implication (e.g. \(A\Rightarrow B\)).
Theorem 2.23. Let \(\bm \Theta\) be a finite hypothesis class with cardinality \(M\) and let \(\ell\) be a loss bounded in \([0,C]\). Then for any \(\delta\in(0,1)\), with probability at least \(1-\delta\) over the random draw of the sample \(D\sim\nu^{n}\), the following simultaneous concentration inequality holds: \begin{equation} \forall\bm{\theta}\in\bm\Theta:\qquad L(\bm{\theta})\;\le\; \hat L(D,\bm{\theta}) +C\sqrt{\frac{\log(M/\delta)}{2n}}. \end{equation}
Proof
Since \(\ell\) takes values in \([0,C]\), the random variables \(X_{i}:=\ell\bigl(f_{\bm{\theta}}(\mathbf x_{i}),y_{i}\bigr)-L(\bm{\theta})\) are independent, centered, and bounded by \(C\). Hoeffding’s inequality (Hoeffding, 1963) therefore gives \begin{equation} \mathbb{P} \left(L(\bm{\theta})-\hat L(D,\bm{\theta})>\varepsilon \right) \leq \exp\left(-\frac{2n\varepsilon^{2}}{C^{2}}\right). \end{equation} Equivalently, with probability at least \(1-\delta_{\bm{\theta}}\), \begin{equation} L(\bm{\theta}) \;\le\; \hat L(D,\bm{\theta}) +C\sqrt{\frac{\log(1/\delta_{\bm{\theta}})}{2n}}. \label{eq:pointwise} \end{equation} Setting \(\delta_{\bm{\theta}} := \delta/M\) in Equation \(\eqref{eq:pointwise}\) and applying the union bound yields \begin{equation} \mathbb{P}\left(\exists\bm{\theta}\in\bm\Theta: L(\bm{\theta})-\hat L(D,\bm{\theta}) >C\sqrt{\frac{\log(M/\delta)}{2n}}\right) \leq \sum_{\bm{\theta}\in \bm \Theta}\frac{\delta}{M} =\delta\,. \end{equation} Taking the complement proves the first equation of the theorem. □
Figure 2.9 shows an illustration of Theorem 2.23; the left panel shows how the union–bound complexity term \(C\sqrt{\log(M/\delta)/(2n)}\) decays as \(n^{-1/2}\) (with larger \(M\) shifting the curve upward by \(\sqrt{\log M}\)), while the right panel depicts the uniform-convergence geometry in which, with probability at least \(1-\delta\), all hypotheses lie within a diagonal band of width \(C\sqrt{\log(M/\delta)/(2n)}\) around the identity line. Fix accuracy \(\varepsilon\in(0,C)\) and confidence \(\delta\in(0,1)\). From Hoeffding’s inequality \begin{equation} \mathbb{P} \left(L(\bm{\theta})-\hat L(D,\bm{\theta})>\varepsilon \right) \leq \exp\left(-\frac{2n\varepsilon^{2}}{C^{2}}\right). \end{equation} Solving for \(n\) shows that it suffices to draw \begin{equation} n \;\ge\; \frac{2C^{2}}{\varepsilon^{2}} \bigl[\log M + \log(2/\delta)\bigr] \end{equation} examples in order to guarantee \(L(\hat{\bm{\theta}}_{\mathrm{ERM}}) \leq \hat{L}(D, \hat{\bm{\theta}}_{\mathrm{ERM}})+\varepsilon\) with probability at least \(1-\delta\). Inverting the relationship highlights the logarithmic dependence on \(M\), a fact whose importance cannot be overstated: the magnitude of \(\log M\) constitutes a statistical‑complexity measure of the hypothesis class.
Despite its elegance, Theorem 2.23 suffers from two notorious shortcomings in light of contemporary machine learning:
it is vacuous in the ubiquitous scenario where \(M\) is astronomically large or even countably (or uncountably) infinite, and
it treats \(\hat{\bm{\theta}}_{\mathrm{ERM}}\) only implicitly through a worst‑case argument, thereby failing to exploit any data‑dependent structure of the empirical‑risk landscape.
These deficiencies motivate a transition from naive union‑bound reasoning to the more nuanced PAC–Bayesian paradigm—introduced independently by McAllester (1999) and Seeger (2002)—which replaces the combinatorial term \(\log M\) with an information-theoretic complexity involving the Kullback–Leibler divergence between a posterior and a prior distribution over \(\bm \Theta\).
The PAC–Bayesian Paradigm
The logarithmic‑cardinality bound of Section 2.4.1 becomes uninformative once \(|\bm \Theta|\) is large or infinite. The PAC‑Bayesian paradigm circumvents this by trading the crude union bound for an information–theoretic control. Fix any prior distribution \(\pi\in\mathcal P(\bm \Theta)\) that is chosen independently of the sample \(D\). After observing \(D\) select an arbitrary posterior \(\rho=\rho(D)\in\mathcal P(\bm \Theta)\).
Theorem 2.24 (Catoni, 2007). Let the loss satisfy \(0\le\ell(\cdot, \cdot)\le C\) and fix \(\delta\in(0,1)\). For every \(\lambda>0\) the following event holds with probability at least \(1-\delta\) over the draw of \(D\sim\nu^{n}\): \begin{equation} \forall\rho\in\mathcal P(\Theta):\quad \mathbb{E}_{\bm\theta\sim\rho}[ L(\bm\theta)] \;\le\; \mathbb{E}_{\bm\theta\sim\rho}[ \hat L(D,\bm\theta)] +\frac{\lambda C^{2}}{8n} +\frac{\mathrm{KL}(\rho|\pi)+\log(1/\delta)}{\lambda}. \end{equation}
Proof
Fix an arbitrary \(\lambda>0\). For each \(\bm{\theta}\in\Theta\) define the centered random variable \begin{equation} Z_{\bm{\theta}}(D) := \exp\left\{ \lambda\left(L(\bm{\theta})-\hat L(D,\bm{\theta})\right) -\frac{\lambda^{2}C^{2}}{8n} \right\}. \end{equation} Because \(L(\bm{\theta})\) is deterministic with respect to \(D\), it verifies that \begin{equation} L(\bm{\theta})-\hat L(D,\bm{\theta}) = \frac1n\sum_{i=1}^n \left[ \mathbb{E}_{(\mathbf x,y)\sim\nu}\, \ell\left(f_{\bm{\theta}}(\mathbf x),y\right) - \ell\left(f_{\bm{\theta}}(\mathbf x_i),y_i\right) \right]\,. \end{equation} Each summand is independent, has mean 0, and lies in \([-C,C]\). Hoeffding’s lemma therefore yields \begin{equation} \mathbb{E}_{D} \left[ \exp\left( \lambda \left(L(\bm{\theta})-\hat L(D,\bm{\theta})\right) \right) \right] \le \exp\left(\tfrac{\lambda^{2}C^{2}}{8n}\right), \end{equation} so \(\mathbb{E}_{D} \bigl[Z_{\bm{\theta}}(D)\bigr]\le 1\) for every \(\bm{\theta}\).
Define the (random) moment-generating function \(M(D):=\mathbb{E}_{\bm{\theta}\sim\pi} \bigl[Z_{\bm{\theta}}(D)\bigr]\). By Fubini’s theorem and the previous bound, \(\mathbb{E}_{D}[M(D)]\le 1\). Applying Markov’s inequality gives \begin{equation} \mathbb{P}_{D}\left(M(D) > 1/\delta\right) \le \delta. \end{equation} With probability at least \(1-\delta\) the sample therefore falls in the “good event” \(\mathcal E := \bigl\{D : M(D)\le 1/\delta\bigr\}.\)
Fix \(D\in\mathcal E\) and any posterior \(\rho\in\mathcal P(\bm \Theta)\). Using Jensen’s inequality (the Donsker–Varadhan variational formula), \begin{align} \lambda\,\mathbb{E}_{\rho}\left[L(\bm \theta)-\hat L(D, \bm \theta)\right] -\frac{\lambda^{2}C^{2}}{8n} -\mathrm{KL}(\rho|\pi) &=\mathbb{E}_{\rho} \bigl[\log Z_{\bm{\theta}}(D)\bigr] \\ &\le \log\mathbb{E}_{\rho} \bigl[Z_{\bm{\theta}}(D)\bigr] \\ &\le \log M(D)\;\le\;\log(1/\delta). \end{align} Rearranging and dividing by \(\lambda\) yields \begin{equation} \mathbb{E}_{\rho}\left[L(\bm \theta)\right] \;\le\; \mathbb{E}_{\rho}\left[\hat L(D, \bm \theta)\right] +\frac{\lambda C^{2}}{8n} +\frac{\mathrm{KL}(\rho|\pi)+\log(1/\delta)}{\lambda}\,. \end{equation} Because the argument did not rely on the specific choice of \(\rho\), the bound holds simultaneously for all posteriors. □
Theorem 2.24 states that for any prior \(\pi\) on \(\bm \Theta\), any data-dependent posterior \(\rho\), and any \(\lambda>0\), with probability at least \(1-\delta\) over the draw of the sample \(D\sim \nu^n\), \begin{equation} \mathbb{E}_{\bm\theta\sim\rho}[ L(\bm\theta)] \;\le\; \mathbb{E}_{\bm\theta\sim\rho}[ \hat L(D,\bm\theta)] +\frac{\lambda C^{2}}{8n} +\frac{\mathrm{KL}(\rho|\pi)+\log(1/\delta)}{\lambda}. \end{equation} Thus, the true (population) loss averaged under \(\rho\) is, with high probability, no greater than the empirical loss averaged under \(\rho\) plus a complexity penalty. The penalty has two parts: (i) a variance-like term \(\lambda C^{2}/(8n)\), which decays as \(1/n\), and (ii) a data-dependent term \(\big(\mathrm{KL}(\rho|\pi)+\log(1/\delta)\big)/\lambda\), which shrinks when \(\rho\) stays close to the prior \(\pi\) (small KL) and when looser confidence (\(\delta\) larger) is acceptable. The parameter \(\lambda\) balances these two terms; optimizing the bound in \(\lambda\) yields \begin{equation} \lambda^\star \;=\; \frac{2\sqrt{2n\big(\mathrm{KL}(\rho| \pi)+\log(1/\delta)\big)}}{C}, \end{equation} and \begin{equation} \mathbb{E}_{\rho}[L(\bm \theta)] \le \mathbb{E}_{\rho}[\hat L(D,\bm \theta)] + C\sqrt{\frac{\mathrm{KL}(\rho| \pi)+\log(1/\delta)}{2n}}. \end{equation} Consequently, the generalization gap \(\mathbb{E}_{\rho}[L(\bm \theta)]-\mathbb{E}_{\rho}[\hat L(D, \bm \theta)]\) scales as \begin{equation} O\big(\sqrt{(\mathrm{KL}+\log(1/\delta))/n}\big)\,. \end{equation} Practically, the theorem justifies learning by minimizing the right-hand side over \(\rho\): it leads to objectives of the form \(\mathbb{E}_{\rho}[\hat L(D, \bm \theta)]+\tfrac{1}{\lambda}\mathrm{KL}(\rho| \pi)\), i.e., empirical risk plus a PAC–Bayesian regularizer that keeps \(\rho\) close to the prior. Tighter bounds are obtained when (a) \(n\) is large, (b) the loss bound \(C\) is small, and (c) the posterior \(\rho\) concentrates near \(\pi\) (small \(\mathrm{KL}\)), reflecting low model complexity relative to the prior.
As the bound in Theorem 2.24 holds for every \(\rho\), a principled data-dependent choice is the Gibbs posterior \begin{equation*} \rho_{\lambda}(\mathrm d\bm\theta):= \frac{\exp[-\lambda\,\hat L(D,\bm\theta)]\,\pi(\mathrm d\bm\theta)} {\int_{\bm \Theta}\exp[-\lambda\,\hat L(D,\vartheta)]\,\pi(\mathrm d\vartheta)}. \end{equation*} Owing to the Donsker–Varadhan identity, \(\rho_{\lambda}\) minimizes the right-hand side of the theorem inequality, so that \begin{equation} \label{eq:gibbs-bound} \mathbb{E}_{\rho_{\lambda}}[L(\bm \theta)] \le \inf_{\rho\in\mathcal P(\bm \Theta)} \left\{ \mathbb{E}_{\rho}[\hat L(D,\bm\theta)] +\frac{\lambda C^{2}}{8n} +\frac{\mathrm{KL}(\rho|\pi)+\log(1/\delta)}{\lambda} \right\}. \end{equation} If \(\bm \Theta\) is finite and \(\pi\) is uniform, set \(\lambda^{\star} :=C^{-1}\sqrt{2n\log(2M/\delta)}.\) Plugging \(\lambda^{\star}\) into Equation \(\eqref{eq:gibbs-bound}\) gives \begin{equation} \mathbb{E}_{\rho_{\lambda^{\star}}}\![L(\bm \theta)] \;\le\; \min_{\bm\theta\in\bm \Theta}\hat L(D,\bm\theta) +C\sqrt{\frac{\log(2M/\delta)}{2n}}, \end{equation} matching Theorem 2.23 up to the harmless two‑sided‑tail factor \(\sqrt{2}\).
Conclusions
This chapter has established the theoretical and probabilistic foundations upon which the rest of the thesis builds. Beginning from a measure-theoretic formulation of probability, the basic language of random variables, distributions, and conditional dependencies that underlie all Bayesian inference is introduced. Within this framework, Bayes’ theorem provides a mechanism to update prior beliefs in light of data, transforming uncertainty about model parameters into posterior distributions that drive predictive reasoning. The discussion contrasted Bayesian and frequentist perspectives, emphasizing the decomposition of predictive uncertainty into epistemic and aleatoric components, as well as the role of conjugate priors and computational approximations such as variational inference.
Subsequently, Gaussian processes were presented as non-parametric Bayesian models that define distributions directly over functions, offering a principled way to quantify uncertainty in predictions. Their dual interpretation—both as Bayesian linear models in feature space and as Gaussian measures in Hilbert space—reveals their centrality in unifying kernel methods, Bayesian inference, and function-space learning. Practical considerations such as hyperparameter learning, scalability via inducing points, and the use of approximate inference schemes were discussed, highlighting both their flexibility and computational challenges.
Finally, the chapter connected Bayesian inference to the theory of generalization through Probably Approximately Correct (PAC) and PAC–Bayesian bounds. These frameworks formalize how predictive performance can be bounded in expectation and how prior knowledge can regularize learning in a principled manner. Together, the concepts introduced here provide the mathematical and conceptual foundation for the methods developed in subsequent chapters, which extend Bayesian reasoning to deep architectures and analyze their generalization behavior in modern over-parameterized regimes.