Generalization in Neural Networks
The previous chapter focused on methods for post-hoc uncertainty estimation in pre-trained deep networks, introducing the Variational Linearized Laplace Approximation (VaLLA) and Fixed-Mean Gaussian Processes (FMGPs). These approaches demonstrated how probabilistic reasoning can be incorporated into deterministic neural architectures to produce well-calibrated predictive distributions without retraining. Having established a principled framework for estimating uncertainty, we now turn to the complementary question of why deep learning models—often highly over-parameterized—are nevertheless capable of generalizing effectively. Understanding this phenomenon requires moving beyond the estimation of predictive uncertainty to the study of the fundamental mechanisms that govern generalization itself.
Introduction
The success of modern deep neural networks (DNNs) is contradictory when viewed through the lens of classical statistical learning theory. Despite being heavily over-parameterized—often containing far more parameters than training samples—these models not only fit their training data perfectly but also generalize remarkably well to unseen inputs. This empirical observation stands in stark contrast to the predictions of traditional uniform-convergence frameworks, such as VC or Rademacher bounds, and their PAC and PAC–Bayesian generalizations reviewed in Section 2.4, which suggest that interpolation should lead to poor generalization. Bridging this theoretical gap has become one of the central challenges in contemporary machine learning research.
This chapter unifies and extends two complementary lines of work aimed at understanding why deep networks generalize in the over–parameterized regime. The first investigates explicit and ensemble–based mechanisms that promote generalization by encouraging diversity among models, while the second develops a distribution–dependent theoretical framework—based on large–deviation and Chernoff–style bounds—that explains generalization even when interpolation occurs. Together, these perspectives provide a cohesive view that links the geometry of the loss landscape, the stochasticity of training dynamics, and the concentration of empirical losses to the observed generalization behavior of deep models.
A natural connection between this chapter and the previous one lies in viewing generalization as a form of predictive uncertainty control. While post-hoc uncertainty estimation methods such as VaLLA and FMGPs quantify uncertainty after training, the mechanisms that enable generalization can be seen as constraining uncertainty during training, by limiting how model predictions vary across unseen data. From this perspective, generalization arises not only from architectural design or optimization choices, but from how the learning dynamics shape the effective posterior over model hypotheses. Concepts such as flat minima, ensemble diversity, and loss concentration can be interpreted within this probabilistic framework: each corresponds to a preference for solutions whose predictive distributions remain stable under small perturbations of data, parameters, or initialization. This viewpoint provides a coherent transition from uncertainty quantification to the theoretical study of generalization developed in this chapter.
Section 5 introduces a rigorous theory of diversity and generalization in deep ensembles. It formalizes the long-standing intuition that ensembles perform best when their constituent members make diverse errors. A unified diversity measure is proposed, leading to a decomposition of the ensemble generalization error into two interpretable components: the average individual error and a diversity-dependent correction term. This decomposition is then grounded in the PAC-Bayesian framework, yielding distribution-dependent generalization guarantees and practical regularization objectives that encourage beneficial diversity. Empirical results on image classification and regression tasks confirm that the theoretical diversity-error relationship aligns with observed ensemble performance.
Section 5.3 develops PAC–Chernoff bounds, a new class of generalization bounds that remain meaningful in the interpolation regime. By expressing the generalization error in terms of a rate function derived from large-deviation theory, the analysis captures how smoothness, regularization, and architectural invariances influence generalization. The resulting bounds are tight for any model that perfectly fits its training data and naturally reproduce the double–descent phenomenon observed in practice. Classical regularizers such as weight decay, distance–from–initialization penalties, or input–gradient control are all shown to act by enlarging the rate function, effectively increasing model smoothness and improving generalization.
Finally, Section 5.4 examines the implicit bias of stochastic gradient descent (SGD) through the same large–deviation perspective. The stochasticity of mini–batch updates is shown to induce an implicit regularization effect, biasing learning toward flatter minima with higher loss concentration. This provides a distribution–dependent explanation for SGD’s empirical tendency to find solutions with superior generalization properties, and clarifies how optimization hyperparameters—such as learning rate, batch size, and noise structure—affect this bias.
In summary, this chapter demonstrates that generalization in modern neural networks can be understood as an interplay between diversity, smoothness, and concentration. Whether through explicit ensemble training, distribution-dependent PAC-Chernoff bounds, or the stochastic dynamics of SGD, these mechanisms collectively define how deep learning models achieve robust performance beyond classical theoretical limits.
Diversity and Generalization in Deep Neural Network Ensembles
Ensemble methods are one of the most widely used and studied techniques in machine learning (Breiman, 1996, 2001; Hansen & Salamon, 1990). They have been successfully applied in many real-world problems (Girshick et al., 2014; Wang et al., 2012; Ykhlef & Bouchaffra, 2017; Zhou et al., 2014) and are usually part of the winning strategies in many machine learning competitions (Chen & Guestrin, 2016; Hoch, 2015; Puurula et al., 2014; Stallkamp et al., 2012). Recently, ensembles have also become very popular to improve uncertainty modeling in deep neural networks (Lakshminarayanan et al., 2017; Maddox et al., 2019; Wen et al., 2019; Wenzel et al., 2020).
Ensembles are created by combining several individual predictors. It is widely accepted (Dietterich, 2000; Lu et al., 2010) that the prediction performance of an ensemble jointly depends on the individual performance and the diversity of its individual members. Intuitively speaking, a set of predictors is diverse when their predictions do not coincide on all the samples. It is known that when classifiers are diverse, they tend to make independent errors; therefore, when they are aggregated, their errors tend to cancel out (Berend & Kontorovich, 2016), which improves the ensemble prediction. For this reason, diversity has long been recognized as a key factor in ensemble performance (Brown et al., 2005; Cunningham & Carney, 2000; Kuncheva & Whitaker, 2003). The same cancellation of errors effect happens in the case of neural network ensembles (Hansen & Salamon, 1990; Lakshminarayanan et al., 2017; Lee et al., 2016) where heuristic measures of diversity are usually analyzed to get insights into the ensemble learning algorithms (Fort et al., 2019; Wen et al., 2019; Wenzel et al., 2020).
Unfortunately, there is a lack of consensus surrounding the underlying theory that can explain the role of diversity in the generalization performance of ensembles. The error rate of an ensemble and an individual predictor, for example, is well defined by the use of a loss function, but there is no well-established definition of diversity (Kuncheva & Whitaker, 2003). It is not clear how exactly the diversity among ensemble members affects the generalization error of the ensemble.
In this work, a novel theoretical framework that explains the relationship between diversity and the generalization performance of an ensemble is introduced. This theoretical framework is derived from previously published results with no direct connection among them (Krogh & Vedelsby, 1994; Masegosa, 2020; Masegosa et al., 2020). The main contribution is finding a theoretically sound way to combine these previous results in a single theoretical framework that explains the role of diversity in the generalization performance of a wide range of different ensembles. This general framework could potentially help the machine learning community to have a better understanding of the underlying trade-offs that have to be considered when designing novel ensemble learning algorithms, especially in the context of neural networks. The detailed contributions of this framework are the following:
A general measure of ensemble diversity,
A theoretical analysis that shows how the correlation among ensemble members affects diversity,
The exact trade-off that exists between this diversity measure, the performance of the individual predictors, and the generalization error of the ensemble,
An analysis of the strategies used by most of the current neural network ensemble learning algorithms to promote diversity,
An empirical evaluation of this theoretical framework.
This analysis covers model averaging and weighted majority vote ensembles under the cross-entropy loss, square error, and 0-1 loss.
Related Work
The concept of diversity has been described under various labels, including ambiguity (Krogh & Vedelsby, 1994), dependency (Zhou, 2012), orthogonality (Kuncheva & Whitaker, 2003), and disagreement (Masegosa et al., 2020), among others. An extensive literature proposes numerous measures of diversity, most of which are defined directly from the predictions of individual models (Buschjäger et al., 2020; Chandra & Yao, 2004; Kuncheva & Whitaker, 2003; Roli et al., 2001; Tang et al., 2006; Zhou & Li, 2010). Nevertheless, none of these works introduces a generic measure that is applicable across heterogeneous ensemble types while maintaining a formal connection to the ensemble generalization error.
Diversity and Generalization
The earliest theoretical attempts to explain why diversity reduces ensemble error date back to Krogh & Vedelsby (1994) and Geman et al. (1992) in the context of regression ensembles. These analyses demonstrate that higher diversity can lower prediction error; however, they do not characterize how the empirical diversity of a given ensemble relates to its generalization error. This thesis extends the results of Krogh & Vedelsby (1994) and establishes such a link for regression ensembles via PAC-Bayesian bounds (McAllester, 1998). The analysis further carries over to ensembles of classifiers.
Several studies have sought to adapt the squared-error decomposition used for regression ensembles (Krogh & Vedelsby, 1994) to classification settings (Brown, 2009; Jiang et al., 2017; Yu et al., 2011; Zhou, 2012). Building on PAC-Bayesian upper bounds on the ensemble risk, the present work derives a new decomposition applicable to a broad class of loss functions; the decomposition explicitly separates a term due to the errors of individual members from a diversity component.
In the setting of majority-vote ensembles and PAC-Bayesian theory, Laviolette et al. (2011) and Germain et al. (2015) proposed bounds that depend on the error rate and the variance of the individual classifiers, where the variance term admits an interpretation as a diversity measure. However, this analysis is restricted to binary classification.
More recently, Masegosa et al. (2020) developed a PAC-Bayesian analysis of majority-vote ensembles for multiclass classification. Nonetheless, the resulting bound includes an explicit diversity term only in the binary case, and that term coincides with the one of (Germain et al., 2015; Laviolette et al.; 2011).
This thesis extends the results of Masegosa et al. (2020) by providing a PAC-Bayesian bound for multiclass classification that explicitly incorporates a novel diversity term. Additional connections are established with regression ensembles and with model averaging for probabilistic classifiers.
In the context of model averaging for probabilistic classifiers, Masegosa (2020) introduced a PAC-Bayesian analysis whose bound explicitly contains a diversity term. However, that work primarily addresses Bayesian model averaging under model misspecification; the role of diversity in ensemble learning is only briefly discussed and is neither theoretically nor empirically examined. The present thesis undertakes a substantially more comprehensive theoretical and empirical study of how diversity affects the generalization performance of several types of ensembles.
Diversity and Ensemble Learning
Virtually all ensemble methods promote diversity among their constituent models, either implicitly or explicitly. Classical techniques such as Bagging (Breiman, 1996, 2001) and Boosting (Freund & Schapire, 1996) encourage diversity implicitly by generating varied training datasets. Deep ensembles likewise rely on implicit mechanisms, including random initialization (Lakshminarayanan et al., 2017; Wen et al., 2019), modifications to the optimizer (Maddox et al., 2019; Wenzel et al., 2020; Zhang et al., 2019), and heterogeneous hyperparameter configurations (Wenzel et al., 2020). This thesis provides a theoretical account of why such diversity-inducing strategies lead to superior ensemble performance.
A growing body of ensemble-learning methods optimizes loss functions augmented with terms that explicitly induce diversity (Buschjäger et al., 2020; Jain et al., 2020; Jiang et al., 2017; Liu & Yao, 1999; Pang et al., 2019). However, these formulations generally lack a formal link to the ensemble’s generalization error, in contrast to the approach developed here. Masegosa (2020) proposed an ensemble-learning algorithm for the cross-entropy loss derived from PAC-Bayesian bounds that explicitly promote diversity, but its evaluation was limited to ensembles of simple neural networks. The present thesis substantially extends that empirical study and applies the analysis to regression ensembles and weighted majority-vote ensembles.
Preliminaries
Before introducing the main theoretical results, it is convenient to establish notation and clarify what is meant by an ensemble and a \(\rho\)-weighted predictor.
In general terms, an ensemble is a collection of predictors—also called members or base models—that are trained independently (or semi-independently) and whose predictions are combined to form a single, aggregate output. Ensembles are widely used in machine learning to reduce variance, improve robustness, and obtain better-calibrated uncertainty estimates compared to any individual model. Formally, if \(\bm{\Theta}\) denotes the parameter space of individual models and \(h_{\bm{\theta}}=h(\cdot;\bm{\theta})\) is a predictor parameterized by \(\bm{\theta}\in\bm{\Theta}\), an ensemble corresponds to a finite or continuous collection of such predictors indexed by \(\bm{\theta}\). The ensemble prediction is then obtained by averaging or voting over its members according to a weighting scheme.
Let \(\rho\) be a probability distribution (or discrete probability mass function) defined over the parameter space \(\bm{\Theta}\). This distribution specifies how much weight or importance is assigned to each individual predictor \(h_{\bm{\theta}}\) within the ensemble. The expectation \(\mathbb{E}_{\rho}[\,\cdot\,]\) thus denotes an average with respect to this distribution. In the simplest case, \(\rho\) is uniform over \(K\) members, \(\rho(\bm{\theta}_k)=1/K\), corresponding to an equally weighted ensemble. More generally, \(\rho\) may assign higher weight to better-performing or more diverse members. The term \(\rho\)-weighted therefore refers to any ensemble quantity (e.g., prediction, loss, or risk) that is defined as an expectation with respect to \(\rho\).
Formally, let \(D=\{(\mathbf{x}_1,y_1),\ldots,(\mathbf{x}_n,y_n)\}\) denote a set of independent and identically distributed samples drawn from an unknown distribution \(\nu\) over \(\mathcal{X} \times \mathcal{Y}\), where \(\mathcal{X}\) and \(\mathcal{Y}\) denote the input and output spaces, respectively. For a single model \(h_{\bm{\theta}}\), its prediction for an input \(\mathbf{x}\in\mathcal{X}\) is \(h_{\bm{\theta}}(\mathbf{x})=h(\mathbf{x};\bm{\theta})\). An ensemble predictor is then defined by integrating (or summing) the predictions of individual members under \(\rho\), yielding the \(\rho\)-weighted ensemble prediction \begin{equation} h_\rho(\mathbf{x}) = \mathbb{E}_{\rho}[\,h_{\bm{\theta}}(\mathbf{x})\,]. \end{equation} In the common case of a finite ensemble composed of \(K\) predictors \(\{h_{\bm{\theta}_1}, \ldots, h_{\bm{\theta}_K}\}\), the distribution \(\rho\) reduces to a discrete probability mass function over these members, \(\rho(\bm{\theta}_k) = w_k\), where \(w_k \ge 0\) and \(\sum_{k=1}^K w_k = 1\). The ensemble output is therefore the weighted average of the individual predictions: \begin{equation} h_\rho(\mathbf{x}) = \sum_{k=1}^{K} w_k\, h_{\bm{\theta}_k}(\mathbf{x}). \end{equation} When all members are assigned equal weights, \(w_k = 1/K\), this simplifies to the uniform average: \begin{equation} h_\rho(\mathbf{x}) = \frac{1}{K} \sum_{k=1}^{K} h_{\bm{\theta}_k}(\mathbf{x}), \end{equation} which corresponds to the standard ensemble used in deep learning practice, such as in bagging or deep ensembles; and the main focus of this thesis. In this sense, \(\rho\) encodes both the composition of the ensemble and the relative influence of each member in forming the aggregate prediction.
Depending on the type of learning problem, an ensemble combines its members’ predictions in different ways, leading to distinct forms of the ensemble predictor \(h_\rho(\mathbf{x})\) and its associated loss function \(\ell(\rho,\mathbf{x},y)\). The three most common cases are as follows:
Regression ensembles use the \(\rho\)-weighted model average predictor and the squared error loss (denoted as the \(sq\)-loss). The prediction of the ensemble for a given input \(\mathbf{x}\) is the \(\rho\)-weighted average of the individual model predictions: \begin{equation} h_\rho(\mathbf{x}) = \mathbb{E}_{\rho}[h(\mathbf{x};\bm{\theta})]. \end{equation} For a specific data point \((\mathbf{x},y)\), the loss of an individual regressor is \begin{equation} \ell_{sq}(\bm{\theta},\mathbf{x},y) = (y - h(\mathbf{x};\bm{\theta}))^2, \end{equation} and the ensemble loss is \begin{equation} \ell_{sq}(\rho,\mathbf{x},y) = \big(y - \mathbb{E}_{\rho}[h(\mathbf{x},\bm{\theta})]\big)^2. \end{equation} In other words, the ensemble predicts the mean output across its members and measures the squared deviation from the true value.
Weighted majority voting ensembles are defined for classification tasks using the \(\rho\)-weighted majority vote predictor and the zero–one loss (denoted as \(0/1\)-loss). Each individual classifier \(h_{\bm{\theta}}\) outputs a discrete class label in \(\mathcal{Y}\). The ensemble prediction is obtained by computing, for each possible class \(y'\), the \(\rho\)-weighted probability that the individual classifiers predict \(y'\), and selecting the class with the highest average support: \begin{equation} h_\rho(\mathbf{x}) = \argmax_{y' \in \mathcal{Y}} \mathbb{E}_{\rho}\big[\mathbb{I}(h(\mathbf{x},\bm{\theta}) = y')\big]. \end{equation} For a finite ensemble, this corresponds to the familiar majority vote: \begin{equation} h_\rho(\mathbf{x}) = \argmax_{y' \in \mathcal{Y}} \sum_{k=1}^K w_k\, \mathbb{I}\!\left(h(\mathbf{x};\bm{\theta}_k) = y'\right). \end{equation} The individual loss for a data point \((\mathbf{x},y)\) is \begin{equation} \ell_{0/1}(\bm{\theta},\mathbf{x},y) = \mathbb{I}[\,h(\mathbf{x},\bm{\theta}) \neq y\,], \end{equation} and the ensemble loss is \begin{equation} \ell_{0/1}(\rho,\mathbf{x},y) = \mathbb{I}\!\left[\argmax_{y' \in \mathcal{Y}} \mathbb{E}_{\rho}\big[\mathbb{I}(h(\mathbf{x},\bm{\theta}) = y')\big] \neq y \right]. \end{equation} Thus, the ensemble prediction corresponds to a weighted vote across models, and the loss indicates whether the majority prediction matches the ground truth.
Model averaging ensembles are used for probabilistic classifiers, where each model outputs a predictive distribution over the labels, \(h(\mathbf{x};\bm{\theta}) = P(\cdot|\mathbf{x},\bm{\theta})\). The ensemble prediction is the \(\rho\)-weighted average of these probability distributions: \begin{equation} P_\rho(y|\mathbf{x}) = \mathbb{E}_{\rho}[P(y|\mathbf{x},\bm{\theta})], \end{equation} which, in the finite case, becomes \begin{equation} P_\rho(y|\mathbf{x}) = \sum_{k=1}^K w_k\, P(y|\mathbf{x},\bm{\theta}_k). \end{equation} This is the standard approach used in Bayesian model averaging and deep ensembles, where the final predicted probability is the mean of the members’ probabilities. For a specific data point \((\mathbf{x},y)\), the individual and ensemble cross-entropy losses are \begin{equation} \ell_{ce}(\bm{\theta},\mathbf{x},y) = -\log P(y|\mathbf{x},\bm{\theta}), \quad \ell_{ce}(\rho,\mathbf{x},y) = -\log \mathbb{E}_{\rho}[P(y|\mathbf{x},\bm{\theta})]. \end{equation} The ensemble prediction is thus a mixture of individual predictive distributions, penalized by the negative log-likelihood of the true label.
For any of these loss functions, denoted generically as \(\ell(\bm{\theta},\mathbf{x},y)\), the empirical loss of an individual model \(h_{\bm{\theta}}\) over a dataset \(D\) is \begin{equation} \hat{L}(\bm{\theta},D) = \frac{1}{n}\sum_{i=1}^n \ell(\bm{\theta},\mathbf{x}_i,y_i), \end{equation} and its expected population loss is \begin{equation} L(\bm{\theta}) = \mathbb{E}_{\nu}[\ell(\bm{\theta},\mathbf{x},y)], \end{equation} where the expectation \(\mathbb{E}_{\nu}\) is taken over \((\mathbf{x},y) \sim \nu\). Analogously, the expected loss of an ensemble is \begin{equation} L(\rho) = \mathbb{E}_{\nu}[\ell(\rho,\mathbf{x},y)]. \end{equation} Throughout the remainder of this chapter, the quantities \(L(\rho)\), \(L(\bm{\theta})\), and \(\hat{L}(\bm{\theta},D)\) are written without explicit subscripts when the context is clear. Unless otherwise stated, all analyses apply to any of the three ensemble settings above. When an explicit subscript is provided (i.e., \(sq\), \(0/1\), or \(ce\)), it refers to the corresponding ensemble model.
Diversity and Generalization
This section develops a PAC-Bayesian bound for multiclass classification that explicitly includes a novel term quantifying the diversity of the ensemble.
Decomposing the Loss of an Ensemble via an Upper Bound
The following result formalizes the intuition that the generalization ability of an ensemble depends not only on the average accuracy of its members but also on how diverse their predictions are. The bound shows that the expected ensemble loss \(L(\rho)\) can be upper-bounded by the average individual loss \(\mathbb{E}_\rho[L(\bm{\theta})]\) minus a nonnegative diversity term \(\mathbb{D}(\rho)\). In other words, even if the constituent models have similar average performance, the ensemble can achieve a strictly lower overall loss whenever its members make uncorrelated or complementary errors.
The factor \(\alpha\) accounts for the scaling properties of different loss functions, ensuring the bound holds uniformly across regression, probabilistic, and classification settings. For the squared-error loss, the bound becomes exact, confirming that the ensemble’s improvement arises precisely from the variance reduction due to averaging. The diversity measures \(\mathbb{D}_{sq}(\rho)\), \(\mathbb{D}_{ce}(\rho)\), and \(\mathbb{D}_{0/1}(\rho)\) quantify, for each case, the expected variability in predictions across ensemble members under the data distribution \(\nu\). Each can be written as the expected variance (with respect to \(\rho\)) of a function \(f(y, \mathbf{x}; \bm{\theta})\) determined by the loss type, unifying all three formulations under a common framework.
In summary, the theorem provides a principled decomposition of the ensemble’s error into two interpretable components:
the average loss of the individual models, reflecting their collective accuracy, and
a diversity term that captures how much the models disagree.
A higher diversity \(\mathbb{D}(\rho)\) therefore implies a tighter upper bound—and potentially a lower ensemble loss—offering a theoretical justification for the well-known empirical benefit of combining diverse models.
Theorem 5.1. For any distribution \(\rho\) over \(\bm{\Theta}\), and any of the three considered loss functions for ensembles, there exists a function \(\mathbb{D}(\rho)\) such that \begin{equation} L(\rho) \leq \alpha(\mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho)), \end{equation} where \(\alpha\) equals \(1\) if the \(sq\)-loss or the \(ce\)-loss are considered, and \(4\) for the \(0/1\)-loss. Furthermore, for the \(sq\)-loss, this inequality becomes an equality. The expression of the diversity measure for each of these loss functions is: \[\begin{eqnarray} \mathbb{D}_{sq}(\rho) &:=& \mathbb{E}_\nu\Big[\mathbb{V}_\rho(h_R(\mathbf{x};\bm{\theta}))\Big],\\ \mathbb{D}_{ce}(\rho) &:=& \mathbb{E}_\nu\left[\mathbb{V}_\rho\left(\frac{P(y |\mathbf{x},\bm{\theta})}{\sqrt{2} \max_{\bm{\theta}} P(y |\mathbf{x},\bm{\theta})}\right)\right],\\ \mathbb{D}_{0/1}(\rho) &:=& \mathbb{E}_\nu\Big[\mathbb{V}_\rho\Big(\mathbb{I}(h_W(\mathbf{x};\bm{\theta})\neq y)\Big)\Big], \end{eqnarray}\] where \(\mathbb{V}_\rho(\cdot)\) denotes the variance of a function w.r.t. \(\rho\), and for \(\mathbb{D}_{ce}(\rho)\) to be well-defined it must verify that \(0<\max_{\bm{\theta}\in \bm{\Theta}} P(y | \mathbf{x},\bm{\theta})\leq 1\) for every \((\mathbf{x}, y) \in supp(\nu)\). Finally, note that all diversity terms described above can be written as \begin{equation} \mathbb{D}(\rho) = \mathbb{E}_\nu \Big[ \mathbb{V}_{\rho} \left( f(y,\mathbf{x};\bm{\theta}) \right)\Big], \end{equation} with a specific function \(f\) for each of the loss functions.
Proof
Using the fact that the variance of the classifiers can be decomposed as: \begin{align} \mathbb{V}_{\rho}(f(\bm{\theta})) &= \mathbb{E}_\rho[f(\bm{\theta})^2] - \mathbb{E}_\rho[f(\bm{\theta})]^2 = \mathbb{E}_\rho[f(\bm{\theta})^2] - \mathbb{E}_{\rho^2}[f(\bm{\theta})f(\bm{\theta}')] \\ &= \mathbb{E}_{\rho^2}\Big[f(\bm{\theta})^2 - f(\bm{\theta})f(\bm{\theta}')\Big], \end{align} where \(f\) is determined by the considered loss function: for the \(sq\)-loss, \(f(\bm{\theta})= h_R(\bm{x};\bm{\theta})\), the \(ce\)-loss, \(f(\bm{\theta})= p(y|\bm{x},\bm{\theta})\), and the \(0/1\)-loss \(f(\bm{\theta}) = \mathbb{I}(h(\bm{x};\bm{\theta})\neq y)\). The diversity terms \(\mathbb{D}(\rho)\) defined in Theorem 5.1 can be written as, \begin{align} \mathbb{D}_{sq}(\rho) & = \mathbb{E}_{\rho^2}\Big[\mathbb{E}_\nu\Big[h_R(\bm{x};\bm{\theta})^2 - h_R(\bm{x};\bm{\theta})h_R(\bm{x};\bm{\theta}')\Big]\Big]\\ \mathbb{D}_{0/1}(\rho) & = \mathbb{E}_{\rho^2}\Big[\mathbb{E}_\nu\Big[\mathbb{I}(h(\bm{x};\bm{\theta})= y)\mathbb{I}(h(\bm{x};\bm{\theta}')\neq y)\Big]\Big]\\ \mathbb{D}_{ce}(\rho) & = \mathbb{E}_{\rho^2}\left[\mathbb{E}_\nu\left[\frac{p( y|\bm{x},\bm{\theta})^2 - p(y\mid \bm{x},\bm{\theta})p(y|\bm{x},\bm{\theta}')}{\displaystyle 2\max_{\bm{\theta}\in \bm{\Theta}} p(y|\bm{x},\bm{\theta})^2}\right]\right] \end{align} where \(\rho^2\) is a shorthand for the product distribution \(\rho \times \rho\) over \(\bm{\Theta} \times \bm{\Theta}\) and the shorthand \(\mathbb{E}_{\rho^2}[f(\bm{\theta},\bm{\theta}')] = \mathbb{E}_{\bm{\theta}\sim\rho, \bm{\theta}'\sim\rho}[f(\bm{\theta},\bm{\theta}')]\). Recall the definition of the expected mean squared error of a regression ensemble reparameterized by the distribution \(\rho\), that is, \begin{equation} L_{sq}(\rho) = \mathbb{E}_\nu \left[(y-h_R(\bm{x};\rho))^2\right], \end{equation} where \(h_R(\bm{x}; \rho) = \mathbb{E}_\rho \left[h_R(\bm{x} ; \bm{\theta})\right]\) with \(h_R(\bm{x} ; \bm{\theta})\) an individual regression model. The desired result can be obtained by expanding the square in each of the elements on the right-hand side of the equation, that is: \begin{align} \mathbb{E}\left[L_{sq}(\bm{\theta})\right] &= \mathbb{E}_{\rho, \nu} \left[(y - h_R(\bm{x}; \bm{\theta}))^2\right] = \mathbb{E}_{\rho, \nu} \left[y^2 - 2yh_R(\bm{x}; \bm{\theta}) + h_R(\bm{x} ; \bm{\theta})^2\right]\\ &= \mathbb{E}_\nu \left[y^2 -2y h_R(\bm{x};\rho) + \mathbb{E}_\rho [h_R(\bm{x}; \bm{\theta})^2]\right], \end{align} where the fact that \(y\) is constant under \(\mathbb{E}_\rho\) is used. On the other hand, \begin{align} \mathbb{D}_{sq}(\rho) &= \mathbb{E}_{\nu,\rho} \left[ (h_R(\bm{x}, \bm{\theta}) - \mathbb{E}_\rho[h_R(\bm{x};\bm{\theta})])^2\right] = \mathbb{E}_{\nu,\rho} \left[ (h_R(\bm{x}, \bm{\theta}) - h_R(\bm{x};\rho))^2\right]\\ &= \mathbb{E}_{\nu,\rho} \left[ h_R(\bm{x}, \bm{\theta})^2 -2h_R(\bm{x}, \bm{\theta})h_R(\bm{x}, \rho) + h_R(\bm{x};\rho)^2\right]\\ &= \mathbb{E}_\nu \left[ \mathbb{E}_\rho[h_R(\bm{x}, \bm{\theta})^2] -2h_R(\bm{x}, \rho)^2 + h_R(\bm{x};\rho)^2 \right]. \end{align} Finally, subtracting both expressions: \begin{align} \mathbb{E}_\rho[L_{sq}(\bm{\theta})] - \mathbb{D}_{sq}(\rho) &= \mathbb{E}_\nu \left[ y^2 - 2yh_R(\bm{x}; \rho) + h_R(\bm{x}; \rho)^2\right]\\ &= \mathbb{E}_\nu \left[(y - h_R(\bm{x}; \rho))^2\right]\\ &= L_{sq}(\rho). \end{align} Continuing with the cross-entropy error, the use of Taylor’s theorem with a remainder of second order over the logarithm function is needed. That is, given \(\log x\) and a fixed value \(a > 0\), \begin{equation} \log x = \log a + \frac{1}{a}(x - a) - \frac{1}{2 \xi^2}(x - a)^2, \quad \xi \in (x, a). \end{equation} Applying this to \(p(y|\bm{x}, \bm{\theta})\) centered at \(\mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] > 0\), \begin{align} \log p(y|\bm{x}, \bm{\theta}) &= \log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] + \frac{1}{\mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]}\left( p(y|\bm{x}, \bm{\theta})- \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] \right) \\ &\quad- \frac{1}{2\xi^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\,. \end{align} Taking expectation over \(\rho\) at both sides, \begin{equation} \mathbb{E}_\rho [\log p(y|\bm{x}, \bm{\theta})] = \log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] - \mathbb{E}_\rho \left[\frac{1}{2\xi^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\right]. \end{equation} Rearranging terms, \begin{equation} -\log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]= -\mathbb{E}_\rho [\log p(y|\bm{x}, \bm{\theta})]- \mathbb{E}_\rho \left[\frac{1}{2\xi^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\right]. \end{equation} The desired inequality arises from the fact that \(\xi\) is between \(p(y|\bm{x}, \bm{\theta})\) and \(\mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\), and hence, is upper bounded by \(max_{\bm{\theta} \in \bm{\Theta}} p(y|\bm{x}, \bm{\theta})\). Additionally, the square in the last term is always positive, implying the whole term is positive. Using these two properties, \begin{align} -\log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] &\leq -\mathbb{E}_\rho [\log p(y|\bm{x}, \bm{\theta})]\\ &\quad- \mathbb{E}_\rho \left[\frac{1}{2 \max_{\bm{\theta}} p(y|\bm{x}, \bm{\theta})^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\right]. \end{align} Finally, taking expectations w.r.t. \(\nu\) on both sides raises the desired result. Lastly, consider the \(0/1\)-error. In order to prove this result, use Markov’s inequality for monotonically increasing functions, in this case, for \(\psi(a) = a^2\). That is, for a given random variable \(X\), \begin{equation} \mathbb{P}(|x| \geq a) \leq \frac{\mathbb{E}[\psi(|x|)]}{\psi(a)} = \frac{\mathbb{E}[|x|^2]}{a^2}, \quad \text{for any } a > 0. \end{equation} Applying this theorem to \(\mathbb{E}_\rho\left[ \mathbb{I}(h_W(\bm{x} ; \bm{\theta}) \neq y)\right]\), it verifies that \begin{equation} \mathbb{P}\big( \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right] \geq 0.5 \big) \leq 4 \mathbb{E}_\nu\left[ \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right]^2\right]. \end{equation} Notice that when majority vote makes an error, at least half (\(\rho\)-weighted) of the classifiers are wrong, that is, \begin{equation} \mathbb{I}(h_W(\bm{x} ; \rho) \neq y) \leq \mathbb{I}[\mathbb{E}_\rho[\mathbb{I}h_W(\bm{x};\bm{\theta}) \neq y] \geq 0.5]\,, \end{equation} which implies \begin{align} \mathbb{E}_\nu \left[ \mathbb{I} \left(h_W(\bm{x} ; \rho) \neq y\right) \right] &\leq \mathbb{E}_\nu \left[ \mathbb{I} \left( \mathbb{E}_\rho\left[\mathbb{I} \left(h_W(\bm{x};\bm{\theta}) \neq y \right)\right] \geq 0.5\right) \right] = \\ & = \mathbb{P}\left( \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right] \geq 0.5 \right)\,. \end{align} Using the derived inequality of the last term, \begin{equation} L_{0/1}(\rho) = \mathbb{E}_\nu \left[ \mathbb{I} \left(h_W(\bm{x} ; \rho) \neq y\right) \right] \leq 4 \mathbb{E}_\nu\left[ \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right]^2\right]. \end{equation} To conclude the proof, it remains to show that the right-hand side of the inequality is the desired upper bound on the \(0/1\) loss. Using that \(\operatorname{Var}(X) = \mathbb{E} \left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - \mathbb{E}[X]^2\) and applying this identity to the indicator \(X=\mathbb{I}\big(h_W(\bm{x};\bm{\theta}) \neq y\big)\), it follows that \begin{align} \mathbb{D} _{0/1}(\rho)) &= \mathbb{E}_\nu\left[\mathbb{E}_\rho\left[(\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y) - \mathbb{E}_\rho\left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right])^2\right]\right])\\ &= \mathbb{E}_\nu\left[\mathbb{E}_\rho\left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right] -\mathbb{E}_\rho \left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]^2 \right]\\ &= \mathbb{E}_\rho\Big[\mathbb{E}_\nu\left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]\Big] -\mathbb{E}_\nu \left[\mathbb{E}_\rho \left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]^2 \right]. \end{align} Which implies that \begin{align} 4\left(\mathbb{E}_\rho[L_{0/1}(\bm{\theta})] - \mathbb{D} _{0/1}(\rho)\right) &= 4\Big(\mathbb{E}_\rho\left[ \mathbb{E}_\nu \left[\mathbb{I} (h_W(\bm{x}; \bm{\theta}) \neq y)\right]\right] - \mathbb{D} _{0/1}(\rho)\Big)\\ &= 4\mathbb{E}_\nu \left[\mathbb{E}_\rho \left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]^2 \right]. \end{align} □
The \(sq\)-version of Theorem 5.1 is equivalent to the well-known decomposition of the squared error of a regression ensemble (Krogh & Vedelsby, 1994). On the other hand, the \(0/1\)-version of Theorem 5.1 is novel and based on the analysis given by (Masegosa et al., 2020), which, in turn, is based on second-order Markov inequalities. Lastly, the \(ce\)-version is equivalent to the one previously proposed by (Masegosa, 2020) based on second-order Jensen inequalities (Becker, 2012; Liao & Berg, 2019). More precisely, the following result demonstrates how to define a tighter second-order Jensen bound for the cross-entropy error using the Jensen inequality stated in (Liao & Berg, 2019).
Theorem 5.2. Any distribution \(\rho\) over \(\bm{\Theta}\) satisfies the following inequality, \begin{equation} L_{ce}(\rho)\leq \mathbb{E}_{\rho}[L_{ce}(\bm{\theta})] - \mathbb{V}^T_{ce}(\rho), \end{equation} where \(\mathbb{V}^T_{ce}(\rho)\) is the normalized variance of \(P(y | \mathbf{x}, \bm{\theta})\) w.r.t. \(\rho(\bm{\theta})\), \begin{equation} \mathbb{V}^T_{ce}(\rho) := \mathbb{E}_{\nu}\Big[h(m,\mu)\mathbb{E}_{\rho}\Big[(P(y | \mathbf{x}, \bm{\theta}) - P(y))^2\Big]\Big]. \end{equation} Where \(\mu:=\mathbb{E}_{\rho}[P(y | \mathbf{x}, \bm{\theta})]\), \(m:= \max_{\bm{\theta}} P(y |\mathbf{x}, \bm{\theta})\) and \(h(m,\mu) := \frac{\ln \mu - \ln m }{(m - \mu)^2} + \frac{1}{\mu(m - \mu)}\).
Proof
Apply (Liao & Berg, 2019)’s result to the random variable \(p(\bm{x} | \bm{\theta})\), following the same strategy used in the proof of Theorem 5.1. □
Theorem 5.2 refines the previous decomposition result by providing a tighter upper bound for the cross-entropy loss based on a second-order version of Jensen’s inequality (Liao & Berg, 2019). While Theorem 5.1 establishes a general first-order relationship between the ensemble loss \(L(\rho)\), the mean individual loss \(\mathbb{E}_\rho[L(\bm{\theta})]\), and a variance-based diversity term \(\mathbb{D}(\rho)\), the present theorem introduces an additional correction factor that accounts for the local curvature of the logarithmic function underlying the cross-entropy loss.
In the earlier bound, the diversity term \(\mathbb{D}_{ce}(\rho)\) captures the average dispersion of the predictive probabilities \(P(y|\mathbf{x},\bm{\theta})\) across the ensemble members. However, this measure depends linearly on the variance and treats all deviations equally, regardless of how sensitive the logarithmic loss is to those deviations. In contrast, Theorem 5.2 introduces the term \(\mathbb{V}^T_{ce}(\rho)\), which incorporates a multiplicative weighting function \(h(m,\mu)\) that depends on both the mean probability \(\mu\) and its maximum value \(m\). This weighting modulates the contribution of the variance according to the curvature of the log function, yielding a tighter and more accurate approximation of the ensemble cross-entropy loss.
Intuitively, while Theorem 5.1 provides a general, interpretable decomposition valid for multiple loss types, Theorem 5.2 specializes this idea to the cross-entropy setting, enhancing precision by taking into account second-order information about how averaging probabilities affects the log-likelihood. The result is a sharper characterization of the ensemble’s improvement, where the benefit of diversity is not only proportional to disagreement among models but also weighted by how much those disagreements matter in terms of predictive confidence.
How to Measure the Diversity of an Ensemble?
In this thesis, the use of the diversity term \(\mathbb{D}(\rho)\) given in Theorem 5.1 is proposed as a diversity measure of an ensemble. This diversity measure satisfies some intuitive properties.
Lemma 5.3. The diversity terms \(\mathbb{D}(\rho)\) defined in Theorem 5.1 satisfy the following properties:
If all the ensemble members provide the same predictions, or if \(\rho\) outs all its probability mass on a single predictor, then \(\mathbb{D}(\rho)\) is null.
\(0\leq \mathbb{D}(\rho) \leq \mathbb{E}_\rho[L(\bm{\theta})]\).
\(\mathbb{D}(\rho)\) is invariant to reparametrizations.
Proof
Notice that if all models make the same prediction, \(h(\bm{x}; \bm{\theta}) = \mathbb{E}_\rho[h(\bm{x} ; \bm{\theta})]\) and \(p(y|\bm{x}, \bm{\theta}) = \mathbb{E}_\rho[p(y|\bm{x}, \bm{\theta})]\), which nullifies all diversity definitions from Theorem 5.1.
Using that every considered loss function and variance are positive, the given inequality is trivial.
Let \(\phi: \bm{\Omega} \to \bm{\Theta}\) be an injective differentiable function with continuous partial derivatives, with non-zero Jacobian at any point. This result follows from the fact that all considered a probability distributions (ensembles) are a finite mixture of delta distributions, as a result, the distribution is compactly supported and the variable change theorem can be applied to a continuous function \(f: \bm{\Theta} \to \mathbb{R}\): \begin{align} \mathbb{E}_\rho[f(\bm{\theta})] &= \int_{\bm{\Theta}} \rho (\bm{\theta}) f(\bm{\theta}) = \int_{\bm{\Omega}} \rho \circ \phi(\bm\omega) \ f \circ \phi(\bm\omega)\ |det(D\phi)(\bm\omega)| \\ &= \mathbb{E}_{\rho '}[f \circ \phi (\bm\omega)] \end{align} where \begin{equation} \rho '(\bm\omega) = \rho \circ \phi(\bm\omega)\ |det(D\phi)(\bm\omega)|. \end{equation} The result follows from taking \(f(\bm{\theta}) = (\bm{\theta} - \mathbb{E}_\rho (\bm{\theta}))^2\).
□
Every point in Lemma 5.3 can be discussed from an empirical point of view:
It is clear that if all models make the same prediction or place the same probabilities, the empirical diversity is zero.
The same arguments used for the theoretical diversity may be applied to the empirical one, using an empirical version of Theorem 5.1 that reduces to take the expectation over the empirical distribution at each step.
To show that invariance beholds, the same procedure can be followed using the empirical expectation. In this case, given that the empirical probability is compactly supported, the variable change theorem holds.
The above properties show that \(\mathbb{D}(\rho)\) can be considered as a measure of the diversity of an ensemble. The first property follows the intuition of diversity as a measure of the difference among ensemble members’ errors, while the second property is something that has been empirically found in the literature: the diversity of an ensemble usually decreases when the predictive error of individual members is reduced (Fort et al., 2019). The last property is a desirable result for any diversity measure.
Another common knowledge about diversity is that it decreases when the predictors are highly correlated (Berend & Kontorovich, 2016; Brown, 2009; Yu et al., 2011; Zhou, 2012). The following result shows how this diversity measure nicely captures this relationship of diversity and correlation among predictors:
Theorem 5.4. The diversity terms \(\mathbb{D}(\rho)\) defined in Theorem 5.1 can be written as \begin{equation} \mathbb{D} (\rho) = \mathbb{V}_{\nu\times\rho}\big(f(y,\mathbf{x};\bm{\theta})\big)- \mathbb{E}_{\rho\times\rho}\big[Cov_{\nu}(f(y,\mathbf{x};\bm{\theta}),f(y,\mathbf{x};\bm{\theta}'))\big], \end{equation} where \(\rho\times\rho\) denotes the joint distribution over \(\bm{\theta} \times\bm{\theta}\), \(\rho\times\nu\) denotes the joint distribution over \(\bm{\theta} \times({\cal X},{\cal Y})\), and \(Cov_{\nu}(\cdot,\cdot)\) is the covariance between two models with respect to the data generating distribution \(\nu\).
Proof
This result can be easily shown as follows. Begin with the definition of variance: \begin{equation} \mathbb{D} (\rho) = \mathbb{E}_{\nu}\Big[ \mathbb{V}_{\rho}\big[f(y, \bm{x}; \bm{\theta})\big] \Big] = \mathbb{E}_\nu \Big[ \mathbb{E}_{\rho^2} \big[f(y, \bm{x}; \bm{\theta})^2 - f(y, \bm{x} ;\bm{\theta})f(y, \bm{x} ;\bm{\theta}')\big] \Big]\,. \end{equation} Split the expectation in two \begin{equation} \mathbb{D} (\rho) = \mathbb{E}_{\nu}\Big[ \mathbb{V}_{\rho}\big[f(y, \bm{x}; \bm{\theta})\big] \Big] = \mathbb{E}_\nu \Big[ \mathbb{E}_{\rho} \big[f(y, \bm{x}; \bm{\theta})^2\big] - \mathbb{E}_{\rho^2} \big[f(y, \bm{x} ;\bm{\theta})f(y, \bm{x} ;\bm{\theta}')\big] \Big]\,. \end{equation} On one hand, use the definition of variance again, getting that: \begin{equation} \mathbb{E}_\nu \Big[ \mathbb{E}_{\rho} \big[f(y, \bm{x}; \bm{\theta})^2\big] = \mathbb{V}_{\nu \times \rho}\big[ f(y, \bm{x}; \bm{\theta}) \big] + \mathbb{E}_{\nu \times \rho}\big[ f(y, \bm{x}; \bm{\theta}) \big]^2\,. \end{equation} On the other hand, the definition of covariance gives: \begin{equation} \mathbb{E}_{\rho\times\rho}\Big[Cov_{\nu}\big(f(y,\bm{x};\bm{\theta}),f(y,\bm{x};\bm{\theta}')\big)\Big] = \mathbb{E}_{\nu \times \rho^2} \big[f(y, \bm{x} ;\bm{\theta})f(y, \bm{x} ;\bm{\theta}')\big] \Big] - \mathbb{E}_{\nu \times \rho}\big[ f(y, \bm{x}; \bm{\theta}) \big]^2\,. \end{equation} Using both terms, the proof is completed: \begin{equation} \mathbb{D} (\rho) = \mathbb{E}_{\nu}\Big[ \mathbb{V}_{\rho}\big[f(y, \bm{x}; \bm{\theta})\big] \Big] = \mathbb{V}_{\nu\times\rho}\big[f(y,\bm{x};\bm{\theta})\big] - \mathbb{E}_{\rho\times\rho}\Big[Cov_{\nu}\big(f(y,\bm{x};\bm{\theta}),f(y,\bm{x};\bm{\theta}')\big)\Big]\,. \end{equation} □
This result states that the proposed diversity measure increases as the correlation among ensemble members is reduced. Eventually, a much higher diversity will be obtained if ensembles are anti-correlated (i.e. negative covariance). However, Theorem 5.4 also introduces a novel insight: ensemble diversity is not only about the correlation among individual models. The above decomposition of the diversity also shows, through the first term (\(\mathbb{V}_{\nu\times\rho}(f(y,\mathbf{x};\bm{\theta}))\)), that ensembles with high diversity should provide different predictions across the different individual models and across the different data samples. Although this is out of the scope of this thesis, this could potentially be used to study why randomization approaches to build ensembles, for example, random forests (Breiman, 2001), result in highly diverse ensembles, as they directly maximize this variance, which is positively related to diversity.
How is Diversity Related to the Performance of an Ensemble?
The role that diversity plays in the performance of an ensemble is described by Theorem 5.1. According to this result, the generalization error of an ensemble \(L(\rho)\) should be reduced as \(\mathbb{D}(\rho)\) is increased, which is a measure of the diversity of the ensemble as shown in the previous section. However, Theorem 5.1 provides additional novel insights.
The following result formalizes an empirically observed phenomenon (Dietterich, 2000; Lu et al., 2010) that higher ensemble diversity induces a higher gap between the average loss of the individual models (\(\mathbb{E}_\rho[L(\bm{\theta})]\)) and the expected loss of the ensemble (\(L(\rho)\)). In other words, the higher the diversity, the higher the advantage of combining these models. The proof is omitted as it is a direct consequence of Theorem 5.1.
Corollary 5.5. For any distribution \(\rho\) over \(\bm{\Theta}\), it verifies that \begin{equation} \mathbb{D}(\rho) \leq \mathbb{E}_\rho[L(\bm{\theta})] - \tfrac{1}{\alpha}L(\rho), \end{equation} where \(\alpha\) is equal to \(1\) for the \(sq\)-loss or the \(ce\)-loss, and \(4\) for the \(0/1\)-loss. Furthermore, for the \(sq\)-loss, this inequality becomes an equality.
Another open question in the ensemble’s literature is under which situations an ensemble of models outperforms a single model. The following result establishes that this occurs when the ensemble’s diversity is large enough.
Corollary 5.6. For any distribution \(\rho\) over \(\bm{\Theta}\), it verifies that an ensemble of models weighted according to a distribution \(\rho\) performs better than a single model \(\bm{\theta}^\star \in \bm{\Theta}\), that is, \(L(\rho)< L(\bm{\theta}^\star)\), if \begin{equation} \mathbb{E}_\rho[L(\bm{\theta})] - \tfrac{1}{\alpha} L(\bm{\theta}^\star) < \mathbb{D}(\rho) \end{equation} where \(\alpha\) is equal to \(1\) for the \(sq\)-loss or the \(ce\)-loss, and \(4\) for the \(0/1\)-loss. For the \(sq\)-loss, the inverse implication also holds.
However, \(\mathbb{D}(\rho)\) is defined in terms of the unknown data-generating distribution \(\nu\). As a result, \(\mathbb{D}(\rho)\) cannot be computed. To address this issue, the use of the empirical version of \(\mathbb{D}(\rho)\), denoted by \(\hat{\mathbb{D}}(\rho,D)\) is proposed. This quantity directly depends on the empirical distribution defined by the data sample \(D\). \(\hat{\mathbb{D}}(\rho,D)\) satisfies the same properties as \(\mathbb{D}(\rho)\); more precisely, it verifies every point in Lemma 5.3:
It is clear that if all models make the same prediction or place the same probabilities, the empirical diversity is zero.
In order to show this, the same arguments used for the theoretical diversity must be applied to the empirical one, using an empirical version of Theorem 1 that reduces to take the expectation over the empirical distribution at each step.
To show that invariance beholds, the same procedure followed in the proof can be used again. In this case, given that the empirical probability is compactly supported, the variable change theorem holds.
Moreover, \(\hat{\mathbb{D}}(\rho,D)\) quantifies the diversity of an ensemble on a given sample \(D\). A central question is how this empirical diversity measure relates to the ensemble’s generalization performance. To address this question, the analysis relies on PAC-Bayesian bounds.
PAC-Bayesian analysis ((Langford & Shawe-Taylor, 2002; McAllester, 1998; Seeger, 2002); see also Section 2.4 for a self-contained introduction) provides Probably Approximately Correct (PAC) upper bounds on a model’s generalization error. The resulting PAC-Bayesian bounds depend on empirical quantities. Specifically, consider a family of bounds that upper-bound the generalization error of an ensemble, \(L(\rho)\), in terms of the \(\rho\)-weighted empirical errors of the individual models, \(\mathbb{E}_\rho[\hat{L}(\bm{\theta},D)]\), together with the empirical diversity of their predictions, \(\hat{\mathbb{D}}(\rho,D)\). These PAC-Bayesian bounds hold with high probability over random draws of the training sample.
Theorem 5.7. For any prior distribution \(\pi\) over \(\bm{\theta}\) independent of \(D\), any \(\xi\in (0,1)\), and any \(\lambda>0\), with probability at least \(1-\xi\) over draws of training data \(D\sim \nu^n\), for all distributions \(\rho\) over \(\bm{\theta}\), simultaneously, \begin{equation} L(\rho) \leq \alpha\left(\mathbb{E}_\rho[\hat L(\bm{\theta},D)] - \hat{\mathbb{D}}(\rho,D) + \frac{2\mathrm{KL}(\rho|\pi) + \epsilon(\nu,\pi,\lambda,n,\xi)}{\lambda\, n}\right) \end{equation}
where \(\alpha\) is equal to \(1\) for the \(sq\)-loss or the \(ce\)-loss, and \(4\) for the \(0/1\)-loss. Furthermore, \(\epsilon\) is a positive function that is independent of \(\rho\) but also depends on the specific loss (the functional forms of these \(\epsilon\) terms can be found in the proof).
Proof
In order to prove this theorem, the r.h.s. term is shown to be an upper bound for \(\alpha(\mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho))\), which, using Theorem 5.1 concludes the proof. First of all, consider the following tandem losses: \begin{align} L_{sq}(\bm{\theta}, \bm{\theta}') & = L_{sq}(\bm{\theta})-\mathbb{E}_\nu\Big[h_R(\bm{x};\bm{\theta})^2 - h_R(\bm{x};\bm{\theta})h_R(\bm{x};\bm{\theta}')\Big]\,,\\ L_{0/1}(\bm{\theta}, \bm{\theta}') & = L_{0/1}(\bm{\theta}) - \mathbb{E}_\nu\Big[\mathbb{I}(h(\bm{x};\bm{\theta})= y)\mathbb{I}(h(\bm{x};\bm{\theta}')\neq y)\Big]\,,\\ L_{ce}(\bm{\theta}, \bm{\theta}') & = L_{ce}(\bm{\theta})- \mathbb{E}_\nu\left[\frac{p(y|\bm{x},\bm{\theta})^2 - p(y|\bm{x},\bm{\theta})p(y|\bm{x},\bm{\theta}')}{\displaystyle 2\max_{\bm{\theta}\in \bm{\Theta}} p(y|\bm{x},\bm{\theta})^2}\right]\,, \end{align} which verify \begin{equation} \mathbb{E}_{\rho^2}[L(\bm{\theta}, \bm{\theta}')] = \mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho). \end{equation} This can be shown using the fact that the variance of the classifiers can be decomposed as \begin{align} \operatorname{Var}_{\rho}\bigl(f(\bm \theta)\bigr) &= \mathbb{E}_{\rho}\left[f(\bm \theta)^2\right] - \mathbb{E}_{\rho}\left[f(\bm \theta)\right]^2 = \mathbb{E}_{\rho}\left[f(\bm \theta)^2\right] - \mathbb{E}_{\rho^2}\left[f(\bm \theta) f(\bm \theta')\right] \\ &= \mathbb{E}_{\rho^2}\left[ f(\bm \theta)^2 - f(\bm \theta) f(\bm \theta') \right]\,, \end{align} where the function \(f\) is determined by the loss under consideration: for the squared loss, \(f(\bm \theta)=h_R(\mathbf x;\bm \theta)\); for the cross-entropy loss, \(f(\bm \theta)=p(y|\mathbf x,\bm \theta)\); and for the \(0/1\)-loss, \(f(\bm \theta)=\mathbb{I}\bigl(h(\mathbf x;\theta)\neq y\bigr)\).
Applying Germain et al. (2016, Theorem 3) to the tandem loss functionals described above with a prior distribution \(\pi(\bm{\theta}, \bm{\theta}') = \pi(\bm{\theta})\pi(\bm{\theta}')\) raises that for any \(\lambda n > 0\) and \(\delta \in (0, 1]\), with probability at least \(1 - \delta\): \begin{equation} \mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho) \leq \mathbb{E}_{\rho(\bm{\theta}, \bm{\theta}')}[ \hat{L}(\bm{\theta}, \bm{\theta}')] + \frac{1}{\lambda n} \left[ \mathrm{KL}(\rho(\bm{\theta}, \bm{\theta}')|\pi(\bm{\theta}, \bm{\theta}')) + \epsilon(\nu,\pi,\lambda,n,\xi) \right]. \end{equation} Where \begin{equation} \epsilon(\nu,\pi,\lambda,n,\xi) := \log \mathbb{E}_{\pi(\bm{\theta}, \bm{\theta}')} \left[\mathbb{E}_\nu\left[\exp \left(\lambda\left(L(\bm{\theta}, \bm{\theta}') - \hat{L}(\bm{\theta}, \bm{\theta}', D)\right)\right)\right]\right] + \log \frac{1}{\delta}, \end{equation} and \(\mathrm{KL}(\rho(\bm{\theta}, \bm{\theta}')|\pi(\bm{\theta}, \bm{\theta}')) = 2\mathrm{KL}(\rho|\pi)\). In short: \begin{equation} L(\rho) \leq \alpha\left( \mathbb{E}_\rho[\hat{L}(\bm{\theta}, D)] - \hat{\mathbb{D}}(\rho, D) + \frac{1}{\lambda n} \Big[ 2\operatorname{KL}(\rho|\pi) + \epsilon(\nu,\pi,\lambda,n,\xi) \Big]\right). \end{equation} □
The \(sq\) and \(0/1\) versions of the PAC-Bayesian bound of Theorem 5.7 are novel. While the \(ce\)-version of Theorem 5.7 was previously proposed in (Masegosa, 2020).
The preceding result provides a clear theoretical explanation for the widely observed phenomena of accurate and diversified models leading to ensemble methods with low generalization error (Dietterich, 2000). High accurate models imply lower values of \(\mathbb{E}_\rho[\hat{L}(\bm{\theta},D)]\), while highly diverse models imply higher values of \(\hat{\mathbb{D}}(\rho,D)\). Consequently, the combined effect of both, as described by the second-order PAC-Bayesian bounds of Theorem 5.7, induces a lower upper bound on the generalization error of the ensemble.
How to Exploit Diversity to Learn Ensembles?
The PAC-Bayesian bounds of Theorem 5.7 provide a rigorous framework for ensemble learning. Because these bounds hold simultaneously for all distributions \(\rho\), it is possible to select the distribution that minimizes the resulting high-probability upper bounds on the ensemble’s generalization error (Langford & Shawe-Taylor, 2002; McAllester, 1998; Seeger, 2002). Section 5.2.5 details how this approach applies to ensembles of neural networks (i.e., how each distribution \(\rho\) specifies an ensemble of neural networks), following the ideas introduced by Masegosa (2020).
As discussed in Section 5.2.1, a growing number of ensemble-learning methods employ objectives that, in addition to favoring members with low error (small \(\mathbb{E}_{\rho}[\hat{L}(\bm{\theta},D)]\)), explicitly encourage diversity. The present analysis furnishes a theoretical justification for such approaches: promoting diversity can improve the ensemble’s generalization performance. Further elaboration is provided in Section 5.2.5.
The situation differs for ensembles of deep neural networks. In this setting, individual models are typically large and operate in the interpolation regime (Zhang et al., 2017); consequently, the term \(\mathbb{E}_\rho[\hat{L}(\bm{\theta},D)]\) appearing in Theorem 5.7 is nearly zero, if not exactly so. Moreover, by Lemma 5.3, the empirical diversity term \(\hat{\mathbb{D}}(\rho,D)\) also becomes negligible. As a result, algorithms that directly minimize the bound in Theorem 5.7 tend to be ineffective, since the diversity component \(\hat{\mathbb{D}}(\rho,D)\) exerts little influence on the learning dynamics. A formal statement of this issue is given in Section 5.2.5, and the next section provides empirical evidence.
This observation appears contradictory: the analysis indicates that high diversity is crucial for generalization, yet deep ensembles exhibit near-zero empirical diversity alongside strong generalization. The apparent paradox is resolved by noting that low diversity on the training data (i.e., \(\hat{\mathbb{D}}(\rho,D)\approx 0\)) does not imply low diversity on the test distribution (i.e., \(\mathbb{D}(\rho) > 0\)), which is the relevant quantity for generalization according to Theorem 5.1.
Most of the current state-of-the-art deep ensemble learning algorithms follow this general scheme: they independently learn each neural network of the ensemble by minimizing the provided loss function (usually the \(ce\)-loss or the \(sq\)-loss) using some randomization method (e.g. random initialization of the parameters (Lakshminarayanan et al., 2017; Wen et al., 2019), or different hyper-parameters for the gradient descent algorithm (Wenzel et al., 2020)) in order to force the gradient descent algorithm to converge to different local minima of the loss function. Current state-of-the-art deep ensemble learning algorithms exploit the highly multi-modal landscape of the loss function (Fort et al., 2019) to achieve that.
In consequence, when the ensemble is composed of \(K\) models \(\{\bm{\theta}_1,\ldots,\bm{\theta}_K\}\) defining different predictive functions, then the expected diversity \(\mathbb{D}(\rho)\) will be positive, as stated in the following result:
Lemma 5.8. If there exists \(\bm{\theta}_i\neq \bm{\theta}_j\) and an input sample \(\mathbf{x}\in supp(\nu)\), where \(supp(\nu)\) denotes the support of the data generating function, such that \(h(\mathbf{x};\bm{\theta}_i)\neq h(\mathbf{x};\bm{\theta}_j)\), it verifies that \(\mathbb{D}(\rho)>0\).
This analysis shows that ensembles of deep neural networks promote diversity by learning neural networks which induce different predictive functions. And this is achieved, in general, by using randomization strategies that exploit the highly multi-modal landscape of the loss function of deep neural networks. The existence evidence in some previous works (Fort et al., 2019) clearly aligns with these conclusions.
Experimental Evaluation
The conducted experimentation is applied to the three kinds of ensembles considered in this thesis. Note that regression ensembles are associated to the \(sq\)-loss, weighted majority vote ensembles are associated to the \(0/1\)-loss, and model averaging ensembles to the \(ce\)-loss. These losses will be used as a way to refer to the different ensembles (for example, the generalization error of a regression ensemble will be denoted by \(L_{sq}(\rho)\)).
The regression ensemble is evaluated on the Wine-Quality (Cortez et al., 2009) data set using a multilayer perceptron with one layer containing 50 hidden units and a dropout layer; this model is denoted as MLP50.
The majority vote and the model averaging ensembles are evaluated on two standard data sets, CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), using two networks: LeNet5 (LeCun et al., 1989) and Resnet20 (He et al., 2016). LeNet5 is chosen for its simplicity, meaning it does not operate on the interpolation regime (i.e. the empirical error is not close to zero). On the other hand, ResNet20 operates in (or close to) the interpolation regime for the CIFAR-10 and the CIFAR-100 data sets. More complex networks could have been employed, but for the aim of this empirical evaluation, ResNet20 is powerful enough.
All ensembles are made of four individual models. This is not an arbitrary amount, as this setting is the default one in the widely used neural network ensemble library Uncertainty Baselines.
Two ensemble learning algorithms are considered: ensemble (Lakshminarayanan et al., 2017), in which each randomly initialized model is optimized independently via gradient descent, and P2B-Ensemble (Masegosa, 2020), in which randomly initialized models are learned jointly by minimizing the PAC-Bayesian bound of Theorem 5.7. The former is a representative approach to neural-network ensembles based on randomization (Lakshminarayanan et al., 2017). In both cases, the \(ce\)-loss is used for CIFAR-10 and CIFAR-100, while the \(sq\)-loss is used for Wine-Quality. For majority-vote ensembles, the \(ce\)-loss is optimized in place of the non-differentiable \(0/1\)-loss, as \(ce\) provides a suitable surrogate. In all settings, error and diversity measures are computed post hoc, enabling an analysis of their relationship.
The test set is used to approximate the generalization error of the ensemble, \(L(\rho)\), the error of individual models, \(L(\bm{\theta})\), and the expected diversity of the ensemble, \(\mathbb{D}(\rho)\). For experiments employing the \(ce\)-loss, a tighter diversity measure introduced by Masegosa (2020) (Theorem 5.2) is adopted. Out-of-distribution benchmarks (Snoek et al., 2019) are not considered, as the objective is to empirically assess the theoretical analysis under the assumption that training and test data are drawn from the same distribution.
Each experiment is repeated five times with different random seeds. Reproducibility code is available at https://github.com/PGM-Lab/2022-AISTATS-diversity. The experimental evaluation is conducted on hardware with 8 TPU cores. The set of hyperparameters considered is reported in Table 5.1.
| Hyper-parameter | LeNet5 | ResNet20 | MLP50 |
|---|---|---|---|
| base learning rate | \(0.001\) | \(0.1\) | \(0.001\) |
| epochs | \(200\) | \(250\) | \(250\) |
| \(\ell_2\) regularization | \(2\cdot10^{-4}\) | \(2\cdot10^{-4}\) | \(2\cdot10^{-4}\) |
| learning rate decay | [\(60\),\(120\),\(160\)] | [\(60\),\(120\),\(160\)] | [\(60\),\(120\),\(160\)] |
| per core batch size | \(64\) | \(64\) | \(32\) |
Experimental Evaluation
The evaluation begins by assessing the tightness of the upper bounds provided by Theorem 5.1. Figure 5.1 reports the results. Each point corresponds to an ensemble specified by a distribution \(\rho_\delta\). The distance of a point to the identity line quantifies the slack of the bound in Theorem 5.1. For the \(sq\)-version, the bound is exact, as stated in the theorem. For the \(ce\)-version, the bound is also fairly tight. For the \(0/1\)-version, two variants are considered: the original with \(\alpha=4\), which is noticeably loose, and an illustrative variant with \(\alpha=1\) (not an upper bound) suggesting that bounds with constants closer to \(1\) may be attainable, noting that \(\alpha=4\) is a worst-case factor (Masegosa et al., 2020).
Section 5.2.3.3 argues that diversity is directly linked to the gains obtained by combining models. More precisely, as stated in Corollary 5.5, ensembles with larger diversity \(\mathbb{D}(\rho)\) should exhibit a larger gap between the average performance of individual models \(\mathbb{E}_\rho[L(\bm{\theta})]\) and the ensemble performance \(L(\rho)\). Figure 5.2 shows that higher diversity is consistently associated with a larger performance gap, i.e., \(\mathbb{E}_\rho[L(\bm{\theta})]-L(\rho)\), across all settings. The figure also indicates that P2B-Ensemble, which explicitly promotes diversity, consistently induces higher-diversity ensembles than the standard Ensemble algorithm.
Section 5.2.3.4 explains why the P2B-Ensemble algorithm—obtained by minimizing the PAC-Bayesian bound of Theorem 5.7—yields superior ensembles when the individual networks do not operate in the interpolation regime. The empirical results in Figure 5.3 corroborate this analysis.
The P2B-Ensemble algorithm effectively induces ensembles with higher diversity that have better generalization performance than the Ensemble algorithm for MLP50 and LetNet5.
However, P2B-Ensemble on ResNet20 does not induce ensembles with much higher diversity than the Ensemble algorithm. Moreover, it performs similar or worst than the Ensemble algorithm.
Figure 5.4 shows that the empirical error and empirical diversity terms of the bound of Theorem 5.7 for the ResNet20 ensemble are much smaller than for the LeNet5 ensemble. In that case, ResNet20 is operating close to the interpolation regime for CIFAR-10 and close to the interpolation regime for CIFAR-100.
Experiments with different ensemble sizes
Figure 5.5 extends Figure 5.2 by considering multiple ensemble sizes. Consistent with Corollary 5.5, ensembles exhibiting larger diversity \(\mathbb{D}(\rho)\) also display a larger gap between the average performance of the individual models, \(\mathbb{E}_\rho[L(\bm{\theta})]\), and the performance of the ensemble, \(L(\rho)\). The figure further shows that, for LeNet5 and MLP-50, Ensemble does not obtain a substantial increase in diversity when the ensemble size grows, whereas for ResNet20, Ensemble steadily increases diversity as the ensemble size increases. A plausible explanation is that random initialization is less effective at exploring distinct modes for simpler architectures than for more complex ones. By contrast, P2B-Ensemble consistently achieves higher diversity as the ensemble size increases.
Figure 5.6 reports the generalization performance of ensembles learned with Ensemble and P2B-Ensemble for ensemble sizes from two to five. For LeNet5, increasing the number of models degrades ensemble performance, in contrast to ResNet20 and MLP-50, where performance improves with larger ensembles. A convincing explanation for this phenomenon is not currently available.
Evaluation of Corollary 5.6
Finally, evidence is provided regarding the usefulness of Corollary 5.6 for determining when an ensemble surpasses a single model. Figure 5.7 reports how often the condition of Corollary 5.6 is satisfied by ensembles trained with Ensemble, the standard ensemble-learning algorithm. Consequently, those ensembles meeting the condition outperform the best individual model. Points below the line correspond to ensembles that satisfy the condition. As shown for the \(ce\)-loss on CIFAR-10 and CIFAR-100, all Ensemble models meet the condition and outperform the best individual model, indicating that random initialization constitutes an effective mechanism for learning high-quality ensembles. By contrast, for the \(sq\)-loss on Wine-Quality with MLP-50, Ensemble often fails to produce ensembles that generalize better than the best individual model. As illustrated in Figure 5.2, this shortfall arises because Ensemble yields ensembles with very low diversity.
For the 0-1 loss, the same ensembles trained with the \(ce\)-loss are evaluated (recall that majority-vote ensembles are trained with \(ce\)). In this case, no model satisfies the condition of Corollary 5.6 under the bound with \(\alpha=4\). Nevertheless, the ensemble consistently outperforms the best single model empirically. The worst-case constant \(\alpha=4\) substantially weakens the bound and limits its applicability. Setting \(\alpha=1\)—acknowledging that this choice does not yield an upper bound—resolves the issue in the sense that all models then satisfy the corollary’s condition, suggesting the possibility of tighter constants closer to \(1\).
How to Exploit Diversity to Learn Ensembles
This section explains how to leverage diversity—guided by PAC-Bayesian analysis—to construct and optimize ensemble models.
Working with a Finite Parameter Space
The assumption of a finite parameter space is not restrictive in this setting. Let \(\bm{\Theta}=\{\bm{\theta}_1,\ldots,\bm{\theta}_K\}\) with potentially very large \(K\). In principle, \(\bm{\Theta}\) could contain all finite-precision vectors of dimension \(M\). In any case, the distribution \(\rho\) assigns positive probability only to those models that constitute the ensemble, so that only a sparse subset of \(\bm{\Theta}\) receives nonzero mass.
Working with a Continuous Parameter Space
The only point at which adopting a continuous parameter space affects the argumentation is in Lemma 5.3, specifically in the reparameterization-invariance property of the diversity measures. If \(\bm{\Theta}\) is a continuous, non-compact set such as \(\mathbb{R}^M\) for some \(M\in\mathbb{N}\), the change-of-variables theorem cannot be invoked directly. To circumvent this issue, consider the compact truncation: \begin{equation} \bm{\Theta}=\bigl\{\bm{\theta}\in\mathbb{R}^M:\ \|\bm{\theta}\|_2\le N\bigr\}\subset\mathbb{R}^M, \end{equation} where \(N\) denotes the largest \(\ell_2\) norm attainable under the chosen finite-precision representation. This restriction renders the requisite transformations valid while leaving the set of implementable models unchanged in practice.
Mixtures of multivariate Gaussian Distributions approximation
As detailed in Section 5.2.3.4, consider a uniformly weighted Gaussian mixture, denoted by \(\rho_\delta\), to represent an ensemble of \(K\) models. The distribution over parameters, given a fixed set of means \((\bm{\theta}_1,\dots,\bm{\theta}_K)\), is \begin{equation} \rho_\delta(\bm{\theta}) = \frac{1}{K}\sum_{k=1}^{K} \mathcal{N}\bigl(\bm{\theta};\, \bm{\theta}_k,\, \epsilon I\bigr). \end{equation} Under this distribution, the expected empirical loss is \begin{align} \mathbb{E}_{\rho_{\delta}}[\hat{L}(\bm{\theta},D)] &= \int \rho_\delta(\bm{\theta})\, \hat{L}(\bm{\theta},D)\, d\bm{\theta} \\ &= \frac{1}{K}\sum_{k=1}^{K} \int \mathcal{N}\bigl(\bm{\theta};\, \bm{\theta}_k,\, \epsilon I\bigr)\, \hat{L}(\bm{\theta},D)\, d\bm{\theta} \\ &= \frac{1}{K}\sum_{k=1}^{K} \mathbb{E}_{\mathcal{N}(\bm{\theta}_k,\epsilon I)}\bigl[\hat{L}(\bm{\theta},D)\bigr]. \end{align} When \(\epsilon\) is sufficiently small and \(\hat{L}\) is smooth, the expectation under the sharply concentrated Gaussian can be approximated by the loss evaluated at the mean. More precisely, \begin{equation} \label{eq:approximateGaussian} \int_{\bm{\theta}} \mathcal{N}(\bm{\theta}; \bm{\theta}_k, \epsilon I) f(\bm{\theta})\approx f(\bm{\theta}_k) \quad \forall k =1,\dots,K. \end{equation} Using that, the expected value of the loss can be approximated as \begin{equation} \mathbb{E}_{\rho_\delta}[\hat{L}(\bm{\theta},D)] = \frac{1}{K}\sum_{k=1}^{K} \int_{\bm{\theta}} \mathcal{N}(\bm{\theta}; \ \bm{\theta}_k, \epsilon I) \hat{L}(\bm{\theta},D) \approx \frac{1}{K}\sum_{k=1}^K\hat{L}(\bm{\theta}_k,D). \end{equation} The same argument applies to the KL regularizer. Using a non-overlapping approximation for the Gaussian mixture components, \begin{equation} \label{eq:approximateGaussianKL} \sum_{i=1}^{K} \mathcal{N}\bigl(\bm{\theta}_k;\,\bm{\theta}_i,\,\epsilon I\bigr) \;\approx\; \mathcal{N}\bigl(\bm{\theta}_k;\,\bm{\theta}_k,\,\epsilon I\bigr) = \frac{1}{\bigl(2\pi\bigr)^{M/2}\epsilon^{M/2}}, \quad \forall\,k\in\{1,\dots,K\}, \end{equation} where \(M\) is the dimensionality of \(\bm{\theta}\). The rationale is that the means \((\bm{\theta}_1,\dots,\bm{\theta}_K)\) are sufficiently separated so that, for any distinct pair \((\bm{\theta}_i,\bm{\theta}_k)\), evaluating the Gaussian centered at \(\bm{\theta}_i\) with covariance \(\epsilon I\) at \(\bm{\theta}_k\) is negligible. This assumption is mild because \(\epsilon\) can be chosen small; for example, enforcing \(\min_{i\ne k}\|\bm{\theta}_i-\bm{\theta}_k\|_2 \ge c\sqrt{\epsilon}\) with a constant \(c\) (e.g., \(c=3\)) renders component overlap negligible (under the “\(3\sigma\)” heuristic, over \(99.7\%\) of the mass lies within \(3\sqrt{\epsilon}\) along any coordinate).
As a result, the following approximation arises for the regularization term: \begin{align} \operatorname{KL}(\rho_\delta|\pi) &= \int_{\bm{\theta}} \rho_\delta(\bm{\theta}) \log \frac{\rho_\delta(\bm{\theta})}{\pi(\bm{\theta})} \\ &= \frac{1}{K}\sum_{k=1}^{K} \int_{\bm{\theta}} \mathcal{N}(\bm{\theta}; \ \bm{\theta}_i, \epsilon I) \left(\log \frac{1}{K}\sum_{i=1}^{K} \mathcal{N}(\bm{\theta}; \ \bm{\theta}_i, \epsilon I) - \log \pi(\bm{\theta})\right) \\ \text{(by Equation \eqref{eq:approximateGaussian})}&\approx \frac{1}{K}\sum_{k=1}^{K}\left(\log \frac{1}{K}\sum_{i=1}^{K} \mathcal{N}(\bm{\theta}_k; \ \bm{\theta}_i, \epsilon I) - \log \pi(\bm{\theta}_k)\right)\\ \text{(by Equation \eqref{eq:approximateGaussianKL})}&\approx \frac{1}{K}\sum_{k=1}^{K}\left(\log \frac{1}{K}\mathcal{N}(\bm{\theta}_k; \ \bm{\theta}_k, \epsilon I) - \log \pi(\bm{\theta}_k)\right)\\ &= -\frac{1}{K} \sum_{k=1}^K \log \pi(\bm{\theta}_k) + \frac{1}{K}\sum_{k=1}^K\log \frac{1}{K} \frac{1}{\sqrt{(2\pi)^M \epsilon^M}}\,. \end{align} The same approximation given by Equation \(\eqref{eq:approximateGaussian}\) can be applied to the general variance formula: \begin{align} \hat{\mathbb{V}}_{\rho_{\delta}}(f(\bm{\theta})) &= \mathbb{E}_{\rho_\delta^2}\Big[f(\bm{\theta})^2 - f(\bm{\theta})f(\bm{\theta}')\Big] = \mathbb{E}_{\rho_\delta}\Big[f(\bm{\theta})^2\Big] - \mathbb{E}_{\rho_\delta^2}\Big[f(\bm{\theta})f(\bm{\theta}')\Big] \\ &= \frac{1}{K}\sum_{k=1}^K f(\bm{\theta}_k)^2 - \frac{1}{K^2}\sum_{i=1}^K \sum_{j=1}^K f(\bm{\theta}_i) f(\bm{\theta}_j). \end{align} Given this, it is easy to approximate each of the diversity terms defined in Theorem 5.1. Where \(f\) is determined by the considered loss function: for the \(sq\)-loss, \(f(\bm{\theta})= h_R(\mathbf{x};\bm{\theta})\), the \(ce\)-loss, \(f(\bm{\theta})= p(y | \mathbf{x},\bm{\theta})\), and the \(0/1\)-loss \(f(\bm{\theta}) = \mathbb{I}(h(\mathbf{x};\bm{\theta})\neq y)\).
Ensemble Learning Algorithms Which Explicitly Promote Diversity
Negative Correlation Learning (Liu & Yao, 1999).
This method minimizes \begin{equation} \mathbb{E}_\rho[\hat L_{sq}(\bm{\theta},D)] \;-\; \lambda\,\mathrm{NC}(\rho,D), \end{equation} where \(\lambda\in[0,1]\), \(\mathbb{E}_D[\cdot]\) denotes expectation with respect to the empirical data distribution, and \begin{equation} \mathrm{NC}(\rho,D) := \mathbb{E}_D\!\left[\frac{1}{K}\sum_{k=1}^K \bigl(h_R(\mathbf{x};\bm{\theta}_k)-h_R(\mathbf{x};\rho)\bigr)\sum_{j\neq k}\bigl(h_R(\mathbf{x};\bm{\theta}_j)+h_R(\mathbf{x};\rho)\bigr)\right] \end{equation} is the empirical negative correlation term that explicitly promotes diversity. By straightforward algebra, \begin{equation} \mathrm{NC}(\rho,D) \;=\; -\,\hat{\mathbb{D}}_{sq}(\rho,D). \end{equation} Hence, the objective becomes \(\mathbb{E}_\rho[\hat L_{sq}(\bm{\theta},D)] + \lambda\,\hat{\mathbb{D}}_{sq}(\rho,D)\), which coincides with the learning objective derived from the PAC-Bayesian analysis for the squared loss. This connection yields a PAC-Bayesian interpretation of the negative correlation ensemble learning algorithm (Liu & Yao, 1999).
Generalized Ambiguity Decomposition (Jiang et al., 2017).
This approach extends the decomposition of Krogh & Vedelsby (1994) to more general loss functions, expressing ensemble risk in terms of individual-model errors and an ensemble “diversity” component. Unlike the present PAC-Bayesian treatment, the resulting decomposition is not derived from upper bounds and coincides with the formulation developed here only for the squared loss. Moreover, it does not relate the ensemble’s generalization error to empirical diversity, does not address multiclass classification, and the decomposition for the \(0/1\)-loss does not include a diversity term. Weighted majority votes are likewise not considered.
Generalized Negative Correlation Learning (Buschjäger et al., 2020).
Building on Jiang et al. (2017), this approach proposes a learning objective that incorporates a diversity term, although the specific form of the term differs from those introduced here. The analysis does not treat the 0-1 loss or weighted majority votes. Ultimately, the authors recommend the alternative objective: \begin{equation} \lambda\,\hat{L}(\rho,D) + (1-\lambda)\,\mathbb{E}_\rho\!\left[\hat{L}(\bm{\theta},D)\right], \qquad \lambda\in[0,1], \end{equation} which trades off the empirical ensemble loss against the average empirical loss of the constituent models without explicitly retaining a diversity regularizer.
Standard Learning Algorithms Do Not Explicitly Promote Diversity
Standard ensemble learning algorithms can be interpreted as methods trying to minimize the following objective function: \begin{equation} \mathbb{E}_{\rho_\delta}[\hat{L}(\bm{\theta},D)] + \frac{\operatorname{KL}(\rho|\pi)}{\lambda n} \end{equation} where either the \(ce\)-loss or the \(sq\)-loss is employed.
This learning objective does not include the diversity term, \(\hat{\mathbb{D}}(\rho,D)\), encouraging diversity. In fact, under the approximations discussed in Section 5 and discarding constant terms, this learning objective can be expressed as: \begin{equation} \frac{1}{K}\sum_k \hat{L}(\bm{\theta}_k,D) - \frac{\ln \pi(\bm{\theta}_k)}{\lambda n} \end{equation} where each \(\bm{\theta}_k\) can be learned independently from the rest due to the presence of \(\hat{\mathbb{V}}(\rho,D)\).
Discussion
Despite the widely recognized importance of diversity for ensemble performance, prior work has not provided a theoretically grounded, broadly applicable account of the relationship between diversity and generalization across a wide range of ensemble models. The present study contributes toward filling this gap.
Theorem 5.1 demonstrates that an upper-bound-based decomposition of ensemble error furnishes a general and practical approach applicable to multiple ensemble formulations. In particular, the analysis clarifies how correlations among predictors influence ensemble diversity. This decomposition also enables a PAC-Bayesian treatment that establishes a direct link between an ensemble’s empirical diversity and its generalization performance, and it sheds light on why ensembles of deep neural networks can effectively promote diversity through randomization strategies.
Although several ingredients of the analysis are known in the literature, the main contribution lies in unifying these elements into a coherent framework for reasoning about diversity and generalization in neural network ensembles.
The theoretical development has limitations that suggest avenues for further research. First, the upper bounds in Theorem 5.1 may admit tighter variants; for example, C-bounds (Germain et al., 2015) are known to improve upon the \(0/1\)-bound (see (Masegosa et al., 2020) for discussion). A similar observation applies to the PAC-Bayesian bounds used in Theorem 5.7, which inherit looseness from the underlying decomposition. Moreover, the analysis focuses on a general form of PAC-Bayesian bounds; more specialized and tighter alternatives could be adopted, as exemplified by Masegosa et al. (2020).
Generalization Error and Chernoff Bounds
In modern machine learning, model classes have such large capacity that optimizers virtually always retrieve a model interpolating the training data; that is, with null training error (Zhang et al., 2017). These models are called interpolators and it is well known that there are many within a large model class (Livni et al., 2014) and that some of them have a remarkably small generalization error (difference between “training error” and “expected or test error”) while others do not (Feldman & Zhang, 2020).
The machine learning community has made a great effort during the last years to understand why modern learning algorithms retrieve, most of the time, interpolators with a small generalization error (Bartlett et al., 2020; Nagarajan & Kolter, 2019b). Most of these attempts often focus on providing generalization bounds. These bounds provide an upper limit on the expected error, tying it to variables related to the training dataset and the model produced by a learning algorithm. Such bounds typically resemble the following form, \begin{equation} \label{eq:generalbounds} L({\cal A}(D)) \leq \hat{L}({\cal A}(D),D) + {\cal C}({\cal A}(D),D,\delta)\,, \end{equation} where \(D\) represents the training dataset, independently and identically distributed (i.i.d.) from a data-generating distribution \(\nu\); \(\mathcal{A}\) denotes the learning algorithm, which generates a hypothesis (a model) from a given dataset (for example, via the gradient descent method); \(L\) and \(\hat{L}\) indicate the expected and empirical errors of a hypothesis, respectively; \(\mathcal{C}\) denotes a complexity measure; and the inequality holds with high probability, at least \(1-\delta\) over draws of datasets from the distribution \(\nu\). A detailed introduction to classical PAC and PAC–Bayesian bounds of this type is given in Section 2.4.
Examples of bounds falling inside this scheme include Vapnik–Chervonenkis (VC) bounds (Bartlett et al., 2019; Vapnik & Chervonenkis, 2015), Rademacher bounds (Mohri et al., 2018), (Norm and margin)-based bounds (Nagarajan & Kolter, 2017; Neyshabur et al., 2015a) and Sharpness-based measures (Keskar et al., 2017; Nagarajan & Kolter, 2019a). Even PAC-Bayes with a fixed prior (McAllester, 1999) can be included here as algorithms that retrieve a distribution over the models.
However, many recent works are increasingly suggesting that these bounds are probably vacuous in the over-parameterized regime and, in consequence, unable to explain generalization in this setting. (Zhang et al., 2017) was the first one to provide quite convincing empirical evidence of this in the context of deep neural networks. (Nagarajan & Kolter, 2019b) later provided a series of experiments and theoretical results in the same direction. These works also show that even in the case where the complexity measure \(\mathcal{C}\) depends directly on the algorithm \(\mathcal{A}\) rather than simply on its output \(\mathcal{A}(D)\) (for example, assuming the algorithm always retrieves models whose parameters’ norm is lower than a given constant) it is also vacuous. Recently, (Wang et al., 2024) showed how any near-interpolator exhibits rapid parameter norm growth, which implies that existing data-dependent parameter-norm-based bounds are necessarily loose, concluding that “explaining the generalization capability of near-interpolators will require new tools”. More generally, (Gastpar et al., 2024) managed to show that, under some conditions resembling over-parameterization, there is no generalization bound like Equation \(\eqref{eq:generalbounds}\) tight for all data-generating distributions, and that algorithm-dependent bounds are also provably vacuous. For all these reasons, the emerging conclusion is that: bounds that solely depend on the training data are provably vacuous for over-parameterized model classes and are unable to explain generalization.
The aforementioned works (Gastpar et al., 2024; Nagarajan, 2021; Nagarajan & Kolter, 2019b) directly or indirectly advocate for the need of exploring alternative bounds that not solely depend on the training data but also use information from the data-generating distribution. These bounds are usually referred to as distribution-dependent bounds; where the complexity term \({\cal C}\) explicitly depends on the data-generating distribution \(\nu\). Distribution-dependent bounds have been used before in different contexts (Catoni, 2007; Zhang, 2006); however, to the best of current knowledge, none of these bounds are known to be tight or, at least, non-vacuous for over-parameterized interpolators. Even more importantly, these kinds of bounds have not been used before to analyze the generalization of over-parameterized interpolators.
Contribution
In this thesis, the use of distribution-dependent bounds is studied within a finite hypothesis space. The goal is to understand how modern learning techniques manage to produce models that both interpolate the data and have a small generalization error. More precisely, Theorem 5.17 introduces a distribution-dependent PAC-Chernoff bound that is perfectly tight for any algorithm retrieving models interpolating the training data, even when the model class is over-parameterized. This bound has the following standard form: \begin{equation} L({\cal A}(D)) \leq \hat{L}({\cal A}(D),D) + {\cal C}({\cal A}(D),n,\nu,\delta)\,, \end{equation} where, now, the complexity term \({\cal C}\) also depends on the data-generating distribution \(\nu\). Proposition 5.20 shows that with high probability \(1 - \delta\), \begin{equation} {\cal C}({\cal A}(D),n,\nu,\delta) \leq L({\cal A}(D)) \leq \hat{L}({\cal A}(D),D) + {\cal C}({\cal A}(D),n,\nu,\delta)\,, \end{equation} which ensures that the proposed bound is perfectly tight for interpolators, that is to say, for models whose empirical loss is null or small enough to be considered negligible or within an acceptable error margin, denoted as \(\hat{L}({\cal A}(D),D)\leq \epsilon\).
As shown in this thesis, these bounds are directly connected to Large Deviation Theory (LDT) (Ellis, 2012) because their complexity measure \({\cal C}({\cal A}(D),n,\nu,\delta)\) directly depends on the so-called rate function (also known as the Cramér-Chernoff function), which is the central element of LDT. The rate function is used to present a new characterization of the smoothness of a model using distribution-dependent measures. According to Theorem 5.23, this approach enables a precise characterization of which interpolators better generalize, addressing an outstanding open question in machine learning.
Despite being based on oracle quantities, the theoretical framework built around this complexity measure based on the rate function allows creating a unified explanation of a wide range of learning techniques used in modern machine learning. Namely, \(\ell_2\)-norm, distance from initialization, and input-gradient regularization (Section 5.3.6), invariant architectures, data augmentation (Section 5.3.7) and over-parameterization (Section 5.3.8). Under this framework, this thesis shows why each of these learning techniques produces interpolators that generalize well and why and when they complement each other (for example, why \(\ell_2\)-norm regularization is also effective in combination with the use of invariant architectures).
The thesis does not claim that the proposed theoretical analyses of each of these learning techniques are better than existing ones at the individual level. For example, it is not claimed that this explanation about why \(\ell_2\)-norm promotes generalization is better than existing ones, for example, in the context of linear regression (Bartlett et al., 2020). However, it is claimed that using the rate function and PAC-Chernoff bounds, which are distribution-dependent quantities, it is possible to jointly explain, up to some degree, the generalization of interpolators, the role of over-parameterization, and why many widely used modern learning techniques find interpolators that generalize. To the best of current knowledge, this is the first theoretical framework capable of achieving insights that cover such a wide range of learning techniques. At the same time, this work shows that distribution-dependent bounds are a promising research direction for understanding the generalization of modern machine learning methods.
Related Work
In the context of uniform convergence bounds, many efforts have been made towards obtaining tighter generalization bounds (Bartlett et al., 2017; Golowich et al., 2018; Kawaguchi et al., 2022; Liang et al., 2019; Neyshabur et al., 2017a), by, for example, only considering the models effectively visited by the optimization algorithm. However, it is clear that these bounds are not actually meaningful (see (Nagarajan & Kolter, 2019b) for further references and discussions). The PAC-Bayes framework has also adapted to this new paradigm by exploiting properties of the models induced by the training data set (such as low spectral norm (Neyshabur et al., 2017b), noise stability (Arora et al., 2018b), de-randomization (Negrea et al., 2020), and compression (Arora et al., 2018b; Zhou et al., 2019)), but none of these bounds are shown to be empirically tight in the over-parameterized model classes and, for example, are unable to describe the interplay between invariant architectures and over-parameterization. In fact, very recently, (Gastpar et al., 2024) proved that, for over-parameterized model classes, none of these bounds are tight for all data-generating distributions.
This thesis mainly differs from these related works in the sense that an oracle complexity measure is used, that is, PAC-Chernoff bounds and the rate function assume access to the data-generating distribution. Consequently, it is possible to tightly bound the generalization error of an interpolator, as elaborated in Section 5.3.4. This capability stems from robust results such as Theorem 5.23 even under unbounded losses. Furthermore, Theorem 5.23 and the PAC-Chernoff bound of Theorem 5.17 enable us to leverage architectural invariances with respect to transformations inherent in the data-generating distribution.
Many other recent theoretical studies try to understand specific learning techniques in deep learning in an isolated and independent way to simplify the problem (e.g., (Arora et al., 2016; Arora et al., 2018a; Bubeck & Sellke, 2023; Chen et al., 2020; Gilbert et al., 2017; Gunasekar et al., 2018; Hardt & Ma, 2017; Patel et al., 2016; Poole et al., 2016; Schoenholz et al., 2017; Soudry et al., 2018; Tishby & Zaslavsky, 2015; Vidal et al., 2020)). As a result of such an isolated approach, they are not fully able, most of the time, to establish links among different learning techniques, as done in this work. Furthermore, relying on the insights provided by the field of geometric deep learning (Bronstein et al., 2021), this thesis provides a theoretical explanation, from a statistical inference point of view, of why such invariances induce smoother model classes and better generalization.
Preliminaries
To fix notation and assumptions, let \(D=\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n\) denote a training dataset of size \(n\ge 1\) drawn i.i.d. from an unknown distribution \(\nu(\mathbf{y},\mathbf{x})\). Consider a model class parameterized by \(\bm{\theta}\in\bm{\Theta}\). For any \(\bm{\theta}\in\bm{\Theta}\), let \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) denote the loss, the expected loss (risk) be \(L(\bm{\theta})=\mathbb{E}_\nu[\ell(\mathbf{y},\mathbf{x},\bm{\theta})]\), and the empirical loss be \(\hat{L}(\bm{\theta},D)=\frac{1}{n}\sum_{i=1}^n \ell(\mathbf{y}_i,\mathbf{x}_i,\bm{\theta})\). Define the subset of zero-variance models as \(\bm{\Theta}_0 := \{\bm{\theta}\in\bm{\Theta}:\,\mathbb{V}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta}))=0\}\).
Computations are carried out on a finite-precision machine. If the model is represented by \(p\) parameters and each parameter has numerical precision \(\log_2(k)\) bits, the parameter space can be viewed as a finite grid in \(\mathbb{R}^p\) with at most \(k^{p}\) distinct models (e.g., \(k=2^{32}\) for single precision and \(k=2^{64}\) for double precision). This discretization resides in \(\mathbb{R}^{p}\) and does not restrict the definition of the loss, which may be specified on a continuous domain. Section 5.3.10 discusses how the results may extend to infinite model classes using recently proposed PAC-Bayes–Chernoff bounds (Casado et al., 2024). The specific assumptions adopted in this work are as follows:
Assumption 5.9. The loss is lower-bounded; more precisely, the essential infimum of the loss is finite and positive. That is, \(\forall \theta \in \Theta\), \begin{equation} \begin{aligned} m_{\bm{\theta}} &:=\essinf_{(\mathbf{x}, \mathbf{y})\, \in\, \mathrm{supp}(\nu)} \ell(\mathbf{y},\mathbf{x},\bm{\theta}) \\ &= \sup \Big\{ a \in \mathbb{R} \;:\; \nu\big(\{(\mathbf{x}, \mathbf{y}) \in \mathcal{X} \times \mathcal{Y} \mid \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) < a \}\big) = 0 \Big\} \geq 0 \end{aligned} \end{equation} In words, this is the greatest lower bound on \(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) that holds almost everywhere under the data distribution \(\nu\) is finite and positive. Furthermore, it is assumed that the expected loss is always finite, \(\forall \bm{\theta}\in\bm{\Theta}\,, L(\bm{\theta})< \infty\).
This assumption assures that the loss function is lower-bounded and the generalization error is finite (not necessarily upper-bounded). The first part of the assumption is naturally satisfied in many standard problems; for example, in multiclass classification problems with softmax activation and the cross-entropy loss and regression problems with mean squared error, as it is clear that \(m_{\bm{\theta}} = 0\). The second part of the assumption (\(L(\bm{\theta})< \infty\)) is mainly introduced for the sake of simplicity in the mathematical exposition. In fact, this could be relaxed in many of the theoretical results of this work, under some considerations. However, under the scope of this thesis, that is studying the generalization error of interpolators, assuming a finite expected error, and thus, a finite generalization error, is a reasonable simplification for the exposition of the theoretical results.
Some extra considerations regarding this assumption raise in the case of a Gaussian likelihood and log-loss, which can be considered another usual setting in machine learning. In this setup, the density has to be lower than one for any data sample and the variance of the Gaussian distribution can not be null in order to satisfy Assumption 5.9. However, this is not restrictive; lower than one Gaussian densities is usual in high dimensional Gaussians and can be ultimately imposed with a restriction on the variances. On the other hand, non-zero variances are usually desirable to ensure the stability of machine learning models. In fact, variances are typically restricted to be positive or computed as the exponential of a logarithm-scaled variable, ensuring their positiveness.
As an example of a case not considered under Assumption 5.9, the loss function cannot be Gaussian-distributed, as \(m_{\bm{\theta}} = - \infty\) in this case. In this regard, an exponentially-distributed loss could be used verifying Assumption 5.9.
The Rate Function
The rate function, denoted by \(\mathcal{I}_{\bm{\theta}}(a)\), plays a central role in Large Deviation Theory (LDT) (Ellis, 2012), a branch of probability theory tightly connected to statistical mechanics, that deals with understanding the behavior of rare events or large fluctuations in random systems. The rate function is normally used to understand the underlying structure of these events and how their probabilities change as they move away from the typical or average behavior.
Mathematically, the rate function is defined as the Legendre transform of the cumulant-generating function, denoted by \(J_{\bm{\theta}}(\lambda)\), which is the natural logarithm of the moment-generating function of a random variable, in this case \(L(\bm{\theta}) - \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) with \(\mathbf{y},\mathbf{x}\sim\nu(\mathbf{y},\mathbf{x})\).
Definition 5.10 Rate Function. For any model \(\bm{\theta} \in \bm{\Theta}\), its rate function is a real-valued function \(\mathcal{I}_{\bm{\theta}}: [0,L(\bm{\theta})-m_{\bm{\theta}}) \to \mathbb{R}^{+}_{0}\), defined as \begin{equation} \label{eq:ratefunction} \mathcal{I}_{\bm{\theta}}(a)=\sup_{\lambda>0}\ \lambda a - J_{\bm{\theta}}(\lambda) \quad \forall a \in [0,L(\bm{\theta})-m_{\bm{\theta}})\,, \end{equation} where the cumulant-generating function is a real-valued function \(J_{\bm{\theta}}: \mathbb{R}^{+}_{0} \to \mathbb{R}^{+}_{0}\), defined as \begin{equation} J_{\bm{\theta}}(\lambda) = \log \mathbb{E}_{\nu}\Big[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\Big] \quad \forall \lambda \geq 0\,. \end{equation}
The inverse of the rate function, denoted \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\), will also play a relevant role in this thesis.
Definition 5.11 Inverse Rate Function. For any model \(\bm{\theta} \in \bm{\Theta}\), the inverse rate function is a real-valued function \(\mathcal{I}_{\bm{\theta}}^{-1}: \mathbb{R}^{+}_{0}\to [0,L(\bm{\theta})-m_{\bm{\theta}}]\), defined as \begin{equation} \label{eq:inverseratefunction} \mathcal{I}^{-1}_{\bm{\theta}}(s)=\inf_{\lambda>0} \frac{J_{\bm{\theta}}(\lambda) + s}{\lambda}\quad \forall s \geq 0\,. \end{equation}
Note that, when \(\mathbb{P}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\neq 0\), for \(\forall s \geq -\log \mathbb{P}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) is constantly equal to \(L(\bm{\theta}) - m_{\bm{\theta}}\) and, in consequence, \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) is a generalized inverse of \(\mathcal{I}_{\bm{\theta}}(a)\) (Rockafellar, 1970). For the sake of simplicity, it is assumed throughout the rest of this thesis that for \(a> L(\bm{\theta})-m_{\bm{\theta}}\), \(\mathcal{I}_{\bm{\theta}}(a)=\infty\). Furthermore, it is assumed that binary operators apply to this value following common sense. According to the following proposition, both the rate function and its inverse are well-defined for models satisfying Assumption 5.9.
Proposition 5.12. Under Assumption 5.9, \(\forall\bm{\theta}\in\bm{\Theta}\), \(\mathcal{I}_{\bm{\theta}}(\cdot)\) and \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\), are well defined. That is, \(\forall a\in[0,L(\bm{\theta})-m_{\bm{\theta}})\), \(\mathcal{I}_{\bm{\theta}}(a)<\infty\) and \(\forall s\in\mathbb{R}^{+}_{0}\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)<\infty\).
Proof
First, from Assumption 5.9, it verifies that \(m_{\bm{\theta}}\geq 0\), then \(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) \geq 0 \ \forall (\mathbf{x}, \mathbf{y}) \in \mathcal{X}\times\mathcal{Y}\). Then, \(\lambda(L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta})))\leq \lambda L(\bm{\theta})\). Taking exponential and expectations, \begin{equation} \mathbb{E}_{\nu}\left[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\right] \leq \mathbb{E}_{\nu}\left[e^{\lambda L(\bm{\theta})}\right]\,. \end{equation} Lastly, the expectation on the r.h.s. is constant, leading to \begin{equation} J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu}\left[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\right] \leq \ln \mathbb{E}_{\nu}\left[e^{\lambda L(\bm{\theta})}\right] = \lambda L(\bm{\theta})\,. \end{equation} As, by Assumption 5.9, \(\forall\bm{\theta}\in\bm{\Theta}\), \(L(\bm{\theta})<\infty\) and the function \(J_{\bm{\theta}}(\lambda)\) is well-defined for \(\lambda>0\). It is left to show that the supremum over \(\lambda\) is reached in the definition of the rate function. For that, it will be shown that it is actually a maximum. Firstly, \begin{equation} \frac{\partial}{\partial \lambda} (\lambda a - J_{\bm{\theta}}(\lambda)) = a - \frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda)\,, \end{equation} where the second derivative is negative as \(\frac{\partial^2}{\partial \lambda ^2} J_{\bm{\theta}}(\lambda) \geq 0\) (the cumulant is convex). As a result, when the previous derivative is zero, the maximum is reached. In fact the optimum of \(\lambda\) is a \(\lambda^\star\) such that \(a=\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda^\star)\). Then, \(\forall a \in (0, L(\bm{\theta}) - m_{\bm{\theta}})\), it is necessary to show \(\exists \lambda^\star \in \mathbb{R}^+\). It verifies that \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda)\) is a continuous function, as it is combination of continuous functions: \begin{equation} \frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda) = \frac{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \ln p(\mathbf{y} | \mathbf{x}, \bm{\theta})]}{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]} - \mathbb{E}_{\nu}[\ln p(\mathbf{y}|\mathbf{x}, \bm{\theta})]\,. \end{equation} By standard properties of the cumulant generating function, \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(0) = 0\). On the other had, also by standard properties of the cummulant (Herdegen, 2008, Lemma 1), \begin{equation} \label{eq:limitGradientJ} \lim_{\lambda\rightarrow \infty } \frac{\partial}{\partial \lambda}J_{\bm{\theta}}(\lambda) = ess\sup_{(\mathbf{x}, \mathbf{y})}\ L(\bm{\theta}) - \ell(\mathbf{y},\mathbf{x},\bm{\theta}) = L(\bm{\theta}) - m_{\bm{\theta}}\,. \end{equation} Thus, if \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda)\) is continuous, \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(0) = 0\) and \(\lim_{\lambda\rightarrow \infty } \frac{\partial}{\partial \lambda}J_{\bm{\theta}}(\lambda)=L(\bm{\theta}) - m_{\bm{\theta}}\), then \(\forall a \in [0, L(\bm{\theta}) - m_{\bm{\theta}})\) there always exist a \(\lambda^\star \in \mathbb{R}^+\) such that \(a=\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda^\star)\). Thus, for such values of \(a\) the rate function is finite and well defined.
For \(a\geq L(\bm{\theta})- m_{\bm{\theta}}\), it verifies that the supremum is reached when \(\lambda\to \infty\), because \(J_{\bm{\theta}}(\lambda)\) is monotonically increasing. Every \(a\geq L(\bm{\theta})- m_{\bm{\theta}}\), can be written as \(a=L(\bm{\theta})-b\), where \(b\leq m_{\bm{\theta}}\). Then the limit when \(\lambda\to \infty\) for any \(a\geq L(\bm{\theta})- m_{\bm{\theta}}\) can be written as follows, \begin{align} \lim_{\lambda\rightarrow\infty} \lambda (L(\bm{\theta})- b) - J_{\bm{\theta}}(\lambda) &= \lim_{\lambda\rightarrow\infty} -\lambda b -\ln \mathbb{E}_\nu\left[p(\mathbf{y}|\mathbf{x},\bm{\theta})^\lambda\right] = \lim_{\lambda\rightarrow\infty} -\ln \mathbb{E}_\nu\left[\left(\frac{p(\mathbf{y}|\mathbf{x},\bm{\theta})}{e^{-b}}\right)^\lambda\right]\\ &= -\ln \mathbb{E}_\nu\left[\lim_{\lambda\rightarrow\infty}\left(\frac{p(\mathbf{y}|\mathbf{x},\bm{\theta})}{e^{-b}}\right)^\lambda\right] = -\ln \mathbb{E}_\nu\left[\mathbb{I}(p(\mathbf{y}|\mathbf{x},\bm{\theta})=e^{-b}\right]\\ &= -\ln \mathbb{E}_\nu[\mathbb{I}(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=b)] = -\ln \mathbb{P}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=b)\,. \end{align} If \(b< m_{\bm{\theta}}\) or, equivalently, \(a>L(\bm{\theta})-m_{\bm{\theta}}\), then \(\mathcal{I}_{\bm{\theta}}(a)=\infty\). Moreover, if \(a=L-m_{\bm{\theta}}\), then \(\mathcal{I}_{\bm{\theta}}(a) = -\ln \mathbb{P}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\), which may be well-defined or equal to \(\infty\). By definition of the inverse rate function, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s)=\inf_{\lambda\geq 0}\frac{s+J_{\bm{\theta}}(\lambda)}{\lambda}\leq \lim_{\lambda\rightarrow\infty}\frac{s+J_{\bm{\theta}}(\lambda)}{\lambda} =\lim_{\lambda\rightarrow\infty}\frac{J_{\bm{\theta}}(\lambda)}{\lambda} \end{equation} Then, by L’Hôpital’s rule and Equation \(\eqref{eq:limitGradientJ}\), it verifies that \begin{equation} \lim_{\lambda\rightarrow\infty}\frac{J_{\bm{\theta}}(\lambda)}{\lambda} =\lim_{\lambda\rightarrow\infty}\nabla_\lambda J_{\bm{\theta}}(\lambda) =L(\bm{\theta}) - m_{\bm{\theta}} \end{equation} From Assumption 5.9, \(L(\bm{\theta})<\infty\) and \(m_{\bm{\theta}}\geq 0\), then, \(\forall s\geq 0\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)<\infty\). □
The following result states some properties of the rate and inverse rate function, shedding some light on their monotony, curvature, and shape. In combination with Figure 5.8, this result should provide some intuition on how these two functions behave in general.
Proposition 5.13 (Rockafellar, 1970). For any \(\bm{\theta}\in\bm{\Theta}\), the rate function \(\mathcal{I}_{\bm{\theta}}(\cdot)\) and the inverse rate function \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\) satisfy the following properties,
\(\mathcal{I}_{\bm{\theta}}(\cdot)\) is convex and \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\) is concave; both monotonically increasing.
Their derivatives at the origin are characterized as: \begin{equation} \lim_{a\to 0} \frac{\partial}{\partial a} \mathcal{I}_{\bm{\theta}}(a) = 0 \quad \text{and} \quad \lim_{s \to 0}\frac{\partial}{\partial s} \mathcal{I}^{-1}_{\bm{\theta}}(s) = +\infty\,. \end{equation}
In case \(\log \mathbb{P}(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) < +\infty\), it verifies that \begin{align} &\lim_{a\rightarrow (L(\bm{\theta}) - m_{\bm{\theta}})^-} \mathcal{I}_{\bm{\theta}}(a) = -\log \mathbb{P}(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\,, \\ &\lim_{s\to -\log \mathbb{P}(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) = m_{\bm{\theta}})} \mathcal{I}^{-1}_{\bm{\theta}}(s)=L(\bm{\theta}) - m_{\bm{\theta}}\,. \end{align} Otherwise, the same limits hold treating the negative logarithm as infinite.
\(\mathcal{I}_{\bm{\theta}}(\cdot)\) and \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\) are invariant to reparameterizations.
Note that the rate function of a Gaussian random variable, like many other standard random variables, does not have a vertical asymptote. However, if the loss follows a Gaussian distribution, it will not satisfy the first condition in Assumption 5.9, as the essential infimum is not finite.
The relevance of the rate function is a consequence of the following results; firstly, the classic Chernoff bound defines how likely it is to observe, over different i.i.d. data sets, an empirical loss \(\hat{L}(D, \bm{\theta})\) that deviates from the expected loss \(L(\bm{\theta})\) by a positive quantity \(a > 0\).
Theorem 5.14 (Chernoff, 1952). For any fixed \(\bm{\theta}\in\bm{\Theta}\) and \(a>0\), it satisfies \begin{equation} \mathbb{P}_{D\sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq a\Big)\leq e^{-n \mathcal{I}_{\bm{\theta}}(a)}\,. \end{equation}
Proof
Cramér-Chernoff’s bound states that for any random variable \(X\), it verifies that \(P(X \geq s) \leq \inf_{t > 0} \mathbb{E}[e^{t X}]e^{-t s}\). Applying this result to the random variable over possible datasets \(\hat{L}(D,\bm{\theta}) - L(\bm{\theta})\), for a fixed \(\bm{\theta}\in\bm{\Theta}\), leads to \begin{equation} P( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq s) \leq \inf_{t > 0} \mathbb{E}\left[e^{t ( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) ) }\right]e^{- t s}\,. \end{equation} The expectation in the r.h.s. can be transformed as \begin{align} \mathbb{E}\left[e^{t ( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) )}\right] &= \mathbb{E}\left[e^{t ( \tfrac{1}{n}\ln p(D|\bm{\theta})- \mathbb{E}_\nu[\ln p(\mathbf{y}| \mathbf{x}, \bm{\theta})]}\right] = \mathbb{E}\left[e^{\tfrac{t}{n}\ln p(D|\bm{\theta})}\right] e^{-t\mathbb{E}_\nu[\ln p(\mathbf{y}| \mathbf{x}, \bm{\theta})]}\,. \end{align} Moreover, the first expectation in this last term is \begin{align} \mathbb{E}\left[e^{\tfrac{t}{n}\ln p(D|\bm{\theta})}\right] &= \mathbb{E}_{\nu^n}\left[p(D|\bm{\theta})^{\tfrac{t}{n}}\right] = \mathbb{E}_{\nu}\left[P(\mathbf{y}|\mathbf{x},\bm{\theta})^{\tfrac{t}{n}}\right]^n =\mathbb{E}\left[e^{\tfrac{t}{n}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})}\right]^n\,. \end{align} Using this, and parameterizing \(t\) as \(\lambda n\), with \(\lambda > 0\), \begin{equation} P(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq s) \leq \inf_{\lambda > 0} \mathbb{E}\left[e^{\lambda ( L(\bm{\theta}) - \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) }\right]^n e^{- \lambda n s}\,. \end{equation} Taking exponential and logarithm on the r.h.s, and using the definition of the smoothness function \(J_{\bm{\theta}}(\lambda)\) and the rate function \(\mathcal{I}_{\bm{\theta}}(a)\): \begin{align} P(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq s) &\leq \inf_{\lambda > 0} e^{ n \ln \mathbb{E}\left[e^{\lambda (L(\bm{\theta}) - \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))}\right] - \lambda ns} \leq \inf_{\lambda > 0} e^{nJ_{\bm{\theta}}(\lambda) - \lambda n s} = e^{-n \mathcal{I}_{\bm{\theta}}(s)}\,. \end{align} □
The above bound is relevant in this thesis due to two main factors that characterize its behavior w.r.t. the data set size \(n\) and the generalization error gap \(a\): on one hand, Chernoff’s bound is known to be quite loose in the mean of the variable \((\text{when }a \approx 0)\) but tight on the tail \((\text{when }a \approx L(\bm{\theta}) - m_{\bm{\theta}})\). As a result, the bound is specially useful when talking about interpolators, where the value of \(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})\) is close to its maximum possible value \((a = L(\bm{\theta}) - m_{\bm{\theta}})\), as stated in the following result.
Proposition 5.15. For any fixed \(\bm{\theta}\in\bm{\Theta}\) and \(n>0\), it satisfies \begin{equation} \lim_{a\rightarrow L(\bm{\theta}) - m_{\bm{\theta}}}\mathbb{P}_{D\sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq a\Big) = \lim_{a\rightarrow L(\bm{\theta}) - m_{\bm{\theta}}} e^{-n \mathcal{I}_{\bm{\theta}}(a)}\,. \end{equation}
Proof
It is clear that the limit on both sides is \begin{equation} \mathbb{P}_{D \sim \nu_ n}(\hat{L}(D, \bm{\theta})=m_{\bm{\theta}}) = n\mathbb{P}_{(\mathbf y, \mathbf x) \sim \nu}(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\,. \end{equation} □
On the other hand, Cramér’s Theorem (Cramér, 1938) states that Chernoff’s bound is exponentially tight for large \(n\). Formally, this statement is written as follows,
Theorem 5.16 (Cramér, 1938; Ellis, 2012). For any fixed \(\bm{\theta}\in \bm{\Theta}\) and any \(a>0\), it satisfies \begin{equation} \lim_{n\rightarrow \infty} -\frac{1}{n}\log \mathbb{P}_{D \sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq a\Big) = \mathcal{I}_{\bm{\theta}}(a)\,. \end{equation}
Proof
Direct application of Cramér’s Theorem (Cramér, 1938; Ellis, 2012), over the random variable \(X=L(\bm{\theta}) - L(D,\bm{\theta})\), for a fixed \(\bm{\theta}\), where the randomness comes from \(D\sim\nu^n\). Similar to the application of Chernoff’s bound in Theorem 5.14. □
In LDT, the above asymptotic result is intuitively interpreted using the following equality, \begin{equation} \label{eq:asympoticEquality} \mathbb{P}_{D \sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq a\Big)= e^{-n\mathcal{I}_{\bm{\theta}}(a) + o(n, a)}\,, \end{equation} which shows that the exact expression of \(\mathbb{P}_{D\sim \nu^n}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq a)\) is defined by the rate function, up to a sub-exponential term, that is negligible when \(n\) is large, because \(\lim_{n\rightarrow\infty} \tfrac{o(n,a)}{n}=0\) and, in consequence, it does not have any meaningful effect. Then, according to LDT, when \(n\) is large, the rate function would be the key quantity describing the generalization error of a model, that is, statistical behavior of the difference between the expected \(L(\bm{\theta})\) and the empirical loss \(\hat{L}(D, \bm{\theta})\).
Summarizing the preceding analysis, the Chernoff bound, governed by the rate function \({\cal I}_{\bm{\theta}}(\cdot)\), is tighter for larger datasets (when \(n\rightarrow\infty\)) and for models interpolating the data (\(a \approx L(\bm{\theta}) - m_{\bm{\theta}}\)), which are the settings of modern machine learning.
The Chernoff bound asserts that a model’s generalization error is governed by its rate function. This rate function is an oracle, distribution-dependent quantity because it depends on the data-generating distribution \(\nu\). Although \(\nu\) is unknown in typical machine-learning settings, this dependence is informative: it helps explain why data augmentation is effective, why over-parameterized model families are beneficial, and why invariant architectures (e.g., convolutional neural networks) are desirable. Moreover, the rate function can be estimated from an independent dataset serving as a proxy for \(\nu\). This aligns with standard practice that uses a separate validation dataset to estimate \(L(\bm{\theta})\) while avoiding data snooping. As shown in Section 5, the rate function \(\mathcal{I}_{\bm{\theta}}(\cdot)\) can be estimated by evaluating the model once on the validation set and applying log-sum-exp operations to compute \(J_{\bm{\theta}}(\lambda)\), followed by a simple grid search to optimize \(\lambda\) in Definition 5.10. The practical consequence—that the cumulant and rate functions can be readily estimated and plotted (see Figure 5.9)—is noteworthy, as it enables empirical illustration and validation of the framework’s theoretical predictions.
| Inception | Crop | L2 | Train Acc. | Test Acc. | Test NLL | \(\ell_2\)-norm |
|---|---|---|---|---|---|---|
| Standard | no | no | \(99.99\%\) | \(84.36 \%\) | \(0.65\) | \(304\) |
| Crop | yes | no | \(99.94\%\) | \(86.89\%\) | \(0.58\) | \(309\) |
| L2 | no | yes | \(100.0\%\) | \(86.60 \%\) | \(0.49\) | \(200\) |
| L2-Crop | yes | yes | \(99.98\%\) | \(88.45 \%\) | \(0.42\) | \(130\) |
| Random | no | no | \(100.0\%\) | \(10.13 \%\) | \(5.52\) | \(311\) |
| Initial | - | - | \(10.00\%\) | \(10.00 \%\) | \(2.30\) | \(593\) |
Let us briefly illustrate why the rate function is useful to understand the generalization of interpolators. Figure 5.9 displays an estimation of the rate function \(\mathcal{I}_{\bm{\theta}}(\cdot)\) for some neural network models used in (Zhang et al., 2017). This work demonstrated that, within the same model class, interpolators that merely memorize the training data (as represented by “Random” in the figure, which has been learned using a random-labeled training data set) can co-exist with others that generalize exceptionally well, as the ones learned using weight-norm regularization (labeled as “L2”) and/or data-augmentation techniques (random-cropping labeled as “Crop”). According to Chernoff’s bound, models with a larger rate function are less likely to have significant disparities between their expected and empirical losses; in other words, the empirical loss \(\hat{L}(D, \bm{\theta})\) is more concentrated around its mean \(L(\bm{\theta})\). Figure 5.10 illustrates this fact for three of these models by plotting histograms that showcase the distribution of \(\hat{L}(D, \bm{\theta})\) across various data sets of size \(n=50\) (retrieved from the test set). From the histograms, it is clear that the concentration of \(\hat{L}(D, \bm{\theta})\) varies among the models; the initial model, defined by Kaiming or He initialization (Goodfellow et al., 2016), has a prominent rate function. For this model, \(\hat{L}(D, \bm{\theta})\) is tightly concentrated around its mean, \(L(\bm{\theta})=\log 10\), observable through the minute scale of its x-axis. Comparatively, the Standard model, trained using SGD and characterized by a smaller rate function, has a more dispersed distribution around its mean \(L(\bm{\theta})=0.65\). This dispersion is notably wider than that of the L2-Crop model. The latter, also trained with SGD but incorporating both \(\ell_2\) regularization and data-augmentation, has a larger rate function than the Standard model.
Standard
L2-Crop
Initial Model
The main purpose of Figures 5.9 and Figure 5.10 is to illustrate that \(\hat{L}(\bm{\theta},D)\) is a random variable whose concentration around its mean \(L(\bm{\theta})\) can vary substantially, a property captured by the model’s rate function. When a model interpolates the training data, the observed value of \(\hat{L}(\bm{\theta},D)\) should be interpreted as one realization of this random variable for the particular dataset \(D\). Consequently, when comparing two models that both interpolate, preference should be given to the one whose empirical loss is more tightly concentrated around its mean (i.e., exhibits a larger rate function), as it is more likely to achieve a smaller expected loss on new data. For example, the left and center panels of Figure 5.10 depict the distributions of \(\hat{L}(\bm{\theta},D)\) for the Standard and L2-Crop models, respectively; since both interpolate the training set but L2-Crop exhibits a more concentrated empirical-loss distribution, L2-Crop is preferable.
Generalization of Interpolators
This section presents the main results—Theorems 5.17 and Theorem 5.23—which demonstrate that the (inverse) rate function characterizes the generalization performance of interpolating models. In addition, a new notion of model smoothness is introduced that aligns with the principle that, under suitable conditions, a sufficiently smoother interpolator generalizes better.
Distribution-Dependent Bounds for Over-parameterized Interpolators
As discussed in the introduction, current high-probability generalization bounds in the form of Equation \(\eqref{eq:generalbounds}\) have been unable so far to explain the generalization of over-parameterized interpolators. Different works (Gastpar et al., 2024; Nagarajan, 2021; Nagarajan & Kolter, 2019b; Wang et al., 2024) have also shown that the problem lies in the impossibility of having tight generalization bounds solely depending on the training data for over-parameterized model classes. Here, a tight bound refers to one whose value closely matches the true generalization error, leaving little or no unexplained gap between the theoretical performance and the bound itself. (Gastpar et al., 2024) even concludes that “a bound without explicit distributional assumptions is likely to be not tight”. Thus, an open question arises:
Open Question 5.1. Are there tight distribution-dependent bounds for over-parameterized models?
The answer to this question is not straightforward given that (Nagarajan & Kolter, 2019b) showed examples of distribution-dependent bounds ((Nagarajan & Kolter, 2019b)’s Definition 3.3) which are not tight. However, the following result shows a uniform-convergence bound which applies simultaneously over the model class,
Theorem 5.17. [PAC-Chernoff Bound] With h.p. \(1 - \delta\) over \(D \sim \nu^n\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} L(\bm{\theta}) \leq \hat{L}(D, \bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta})\,. \end{equation}
Proof
By Chernoff’s Theorem 5.14, for a given \(\bm{\theta}\), it verifies that \(\mathbb{P}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq a\Big)\leq e^{-n \mathcal{I}_{\bm{\theta}}(a)}\). Naming \(\delta' = e^{-n \mathcal{I}_{\bm{\theta}}(a)}\) and re-arranging terms, \(a = \mathcal{I}^{-1}_{\bm{\theta}}(-\frac{1}{n}\ln \delta')\). As a result: \begin{equation} \mathbb{P}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\leq \delta'\,. \end{equation} Taking the complementary probability and using the union bound over the set of models, \begin{equation} \mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\leq \sum_{\bm{\theta}\in\bm{\Theta}} \mathbb{P}\Big( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\,. \end{equation} Given that the model space considers \(k^p\) different models, the r.h.s. can be rewritten as \begin{equation} \mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\leq k^p \delta' \,. \end{equation} By reparameterizing the above inequality with \(\delta'=\delta k^{-p}\): \begin{equation} \mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \Big)\leq \delta \,. \end{equation} Which verifies, \begin{equation} 1-\mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \Big)\geq 1-\delta\,. \end{equation} Which is equivalent to, \begin{equation} \mathbb{P}\Big( \bigcap_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \Big)\geq 1-\delta \,. \end{equation} □
In this bound, \(\mathcal{I}^{-1}_{\bm{\theta}}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta})\) defines a complexity measure for the model \(\bm{\theta}\) in the context of a model class defined by \(p\) parameters, the training data set of \(n\) samples and, according to the own definition of \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\) (see Definition 5.11), it also depends on the data-generating distribution \(\nu\). From Proposition 5.13, this complexity measure \(\mathcal{I}^{-1}_{\bm{\theta}}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta})\) monotonically grows with the size of the model class and monotonically decreases with the level of confidence \(\delta\) and the size of the training data.
It is common to find models, within the same model class, that define the same loss function, \(\ell(\cdot, \cdot, \bm{\theta}_1) = \ell(\cdot, \cdot, \bm{\theta}_2)\). For instance, in a Multi-Layer Perceptron (MLP), weights can be permuted in specific ways without altering model predictions, and models with zeros in the last layer produce the same predictions regardless of the weights in other layers. Consequently, when calculating the size of the model class, models that define the same empirical loss across different datasets can be effectively “excluded“. Let \(\bar{\bm{\Theta}} \subseteq \bm{\Theta}\) be a subset of the model class where, if \(\bm{\theta}, \bm{\theta}' \in \bar{\bm{\Theta}}\), there exists a dataset \(D \sim \nu^n\) such that \(\hat{L}(D, \bm{\theta}) \neq \hat{L}(D, \bm{\theta}')\). The following result provides a refined bound, based on the size of \(\bar{\bm{\Theta}}\).
Corollary 5.18. With h.p. \(1 - \delta\) over \(D \sim \nu^n\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} L(\bm{\theta}) \leq \hat{L}(D, \bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{|\bar{\bm{\Theta}}|}{\delta})\,. \end{equation}
Proof
The proof follows the same approach as Theorem 5.17, with the union bound applied to the models in \(\bar{\bm{\Theta}}\) rather than directly to the models in \(\bm{\Theta}\). This adjustment is valid because, when applying the union bound, only the total number of distinct random variables needs to be considered; in this case, \(L(D,\bm{\theta})\) with \(D\sim\nu^n\). Models in \(\bm{\Theta}\) that are not in \(\bar{\bm{\Theta}}\) define random variables that are duplicated by definition and therefore do not need to be accounted for in the application of the union bound. □
For the remainder of this work, \(k^p\) is used in all the presented results. However, in each case, it may be replaced \(k^p\) with the typically smaller term \(|\bar{\bm{\Theta}}|\) by applying the result above. The following result shows that, when \(n\) goes to infinity, the proposed complexity measure converges with rate \(1/\sqrt{n}\) to the (scaled) standard deviation of the loss function.
Theorem 5.19. For any \(\delta \in (0, 1)\) and any \(\bm{\theta}\in \bm{\Theta}\), it verifies that \begin{equation} \lim_{n\rightarrow\infty} \ \sqrt{n}\, \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) = \sqrt{2\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\log \tfrac{k^p}{\delta}}\,. \end{equation}
Proof
Let \(\bar{Z}_n := \sqrt{n} (L(\bm{\theta}) - L(D,\bm{\theta}))\) and \(J_{\bar{Z}_n}, {\cal I}_{\bar{Z}_n}\) and \({\cal I}^{-1}_{\bar{Z}_n}\) denote its cumulant, rate and inverse rate function. Then, by properties of the cumulant, it verifies that \(J_{\bar{Z}_n}(\lambda) = nJ_{\bm{\theta}}(\lambda/\sqrt{n})\). Using the definition of the rate function: \begin{equation} {\cal I}_{\bar{Z}_n}(a) =\sup_{\lambda} \ a\lambda+n J_{\bm{\theta}}(\lambda/\sqrt{n})=n\sup_{\lambda} \ \frac{\lambda}{\sqrt{n}}\frac{a}{\sqrt{n}}+ J_{\bm{\theta}}(\lambda/\sqrt{n})=n\mathcal{I}_{\bm{\theta}}(a/\sqrt{n})\,. \end{equation} Using a second order Taylor expansion over \(\mathcal{I}_{\bm{\theta}}(a)\) around \(a=0\), where \(\mathcal{I}_{\bm{\theta}}(0) = 0\) and \(\frac{\partial}{\partial a}\mathcal{I}_{\bm{\theta}}(0) = 0\): \begin{equation} \mathcal{I}_{\bm{\theta}}(a/\sqrt{n}) = \frac{1}{2} \frac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) \left(\frac{a}{\sqrt{n}}\right)^2 + \left(\frac{a}{ \sqrt{n}}\right)^2 h(a/ \sqrt{n})\,, \end{equation} where \(\lim_{t\to 0} h(t) = 0\). Furthermore, it verifies that (Ellis, 2012) \begin{equation} \frac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) = \left( \frac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(0)\right)^{-1}\,. \end{equation} From there: \begin{align} \frac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(\lambda)&= - \frac{\partial}{\partial \lambda} \frac{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})} \ell(\mathbf{y},\mathbf{x},\bm{\theta})]}{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})}]}\\ &= \frac{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})} (\ell(\mathbf{y},\mathbf{x},\bm{\theta}))^2]}{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})}]} - \frac{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})} \ell(\mathbf{y},\mathbf{x},\bm{\theta})]^2}{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})}]^2}. \end{align} Which evaluated at \(\lambda=0\), gives \begin{equation} \frac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(0)= \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\,. \end{equation} Let this last quantity be written as \(\sigma^2 := \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\). Then, \begin{align} \lim_{n\rightarrow\infty} {\cal I}_{\bar{Z}_n}(a) &= \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}}(a/\sqrt{n}) = \lim_{n\rightarrow\infty} n \frac{1}{2} \sigma^{-2} \left(\frac{a}{ \sqrt{n}}\right)^2 + n \left(\frac{a}{ \sqrt{n}}\right)^2 h(a/ \sqrt{n})\\ &=\frac{1}{2} \sigma^{-2} a^2\,. \end{align} Then, as both the rate and its inverse are continuous functions, it verifies that \begin{equation} \lim_{n\rightarrow\infty} {\cal I}^{-1}_{\bar{Z}_n}(s) = \left(\lim_{n\rightarrow\infty} {\cal I}_{\bar{Z}_n}\right)^{-1}(s) = \sqrt{2\sigma^2s} \end{equation} By definition of the inverse rate: \begin{align} {\cal I}^{-1}_{\bar{Z}_n}(s) &= \inf_{\lambda} \frac{s+J_{\bar{Z}_n}(\lambda)}{\lambda}=\inf_{\lambda} \frac{s+n J_{\bm{\theta}}(\lambda/\sqrt{n})}{\lambda}\\ &=\sqrt{n}\inf_{\lambda} \frac{\frac{s}{n}+n J_{\bm{\theta}}(\lambda/\sqrt{n})}{\frac{\lambda}{\sqrt{n}}}=\sqrt{n}\mathcal{I}^{-1}_{\bm{\theta}}(s/n)\,. \end{align} As a result, \begin{equation} \lim_{n\rightarrow\infty} \sqrt{n}\mathcal{I}^{-1}_{\bm{\theta}}(s/n) = \lim_{n\rightarrow\infty} {\cal I}^{-1}_{\bar{Z}_n}(s) = \sqrt{2\sigma^2s}\,. \end{equation} □
The relevant property of this novel complexity measure is that it is a perfectly tight proxy of the expected loss \(L(\bm{\theta})\) for models that interpolate the training data, \(\hat{L}(D, \bm{\theta})\leq \epsilon\), even if the model class is over-parameterized,
Proposition 5.20. With h.p. \(1 - \delta\) over \(D \sim \nu^n\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} \text{if } \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad\text{then}\quad 0 \leq L(\bm{\theta}) - \mathcal{I}^{-1}_{\bm{\theta}}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \epsilon\,. \end{equation}
Proof
By Theorem 5.17, \(L(\bm{\theta}) \leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\), and, by Proposition 5.12, \(\mathcal{I}_{\bm{\theta}}(a)\) is well defined \(\forall a\in[0,L(\bm{\theta})-m_{\bm{\theta}})\); and, \(\forall b > 0\), it verifies that \(\mathcal{I}^{-1}_{\bm{\theta}}(b)\in [0,L(\bm{\theta})-m_{\bm{\theta}})\). In consequence: \begin{equation} L(\bm{\theta}) \leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq L(D,\bm{\theta}) + L(\bm{\theta}) -m_{\bm{\theta}}\,. \end{equation} The result follows from \(m_{\bm{\theta}}\geq0\), \(L(D,\bm{\theta})\leq \epsilon\), and rearranging terms. □
Using those results, there exists an algorithm-distribution-dependent bound that is perfectly tight for any data distribution and any learning algorithm; even for over-parameterized model classes. However, for this bound to be tight, it is imperative that the model interpolates the training data. Formally, for any fixed \(\epsilon>0\), with \(2^{\mathcal{X} \times \mathcal{Y}}\) denoting the power set of all possible data sets of any size; let \(\mathcal{A}_\epsilon: 2^{\mathcal{X} \times \mathcal{Y}} \to \bm{\Theta}\) be an algorithm that takes any data set and returns a model \(\bm{\theta} \in \bm{\Theta}\) such that its training loss is lower than \(\epsilon\). That is, for any training data set \(D \in 2^{\mathcal{X} \times \mathcal{Y}}\), \(\mathcal{A}_\epsilon\) verifies that \({\hat L}({\cal A}_{\epsilon}(D),D)\leq \epsilon\). This kind of algorithm is quite common in machine learning where the model space \(\bm{\Theta}\) is big enough that there exist models capable of memorizing random labels (Zhang et al., 2017, Theorem 1). Then, for any \({\cal A}_{\epsilon}\) algorithm, with h.p. \(1 - \delta\) over \(D \sim \nu^n\), \begin{equation} L({\cal A}_{\epsilon}(D)) \leq \hat{L}({\cal A}_{\epsilon}(D),D) + {\cal I}^{-1}_{{\cal A}_{\epsilon}(D)}\big(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}\big)\leq L({\cal A}_{\epsilon}(D)) + \epsilon\,. \end{equation} This provides an answer to the Open Question 5.1 using this newly presented bound: PAC-Chernoff bounds are perfectly tight for (over-parameterized) interpolators.
The question now is whether this distribution-dependent bound is useful to understand the generalization of algorithms retrieving (over-parameterized) interpolators. A first insight from the above bound is that the (inverse) rate function of the retrieved interpolator defines its generalization error. This is the first (distribution-dependent) complexity measure characterizing the generalization error of an interpolator even in the context of an over-parameterized model class.
Smoother Interpolators Generalize Better
As mentioned in the introduction, an open question in machine learning is the following:
Open Question 5.2. Given two models \(\bm{\theta} \in \bm{\Theta}\) and \(\bm{\theta}' \in \bm{\Theta}'\), both successfully interpolating the training data, which of them generalizes better?
In this section, a formal result is introduced which provides the following answer to the above question: the smoother interpolator is the one that achieves better generalization, given that it is sufficiently smoother. The condition of being smoother is defined in terms of the rate function of the models. A model \(\bm{\theta}\) is smoother than a model \(\bm{\theta}'\) if, for any \(a>0\), observing a deviation larger than \(a\) between the expected (test) loss and the empirical (train) loss is consistently smaller for \(\bm{\theta}\) than for \(\bm{\theta}'\). This is formalized as follows.
Definition 5.21. Given a data-generating distribution \(\nu\) and a loss function \(\ell\), a model \(\bm{\theta}\in\bm{\Theta}\) is \(\beta\)-smoother than a model \(\bm{\theta}'\in\bm{\Theta}'\) if \begin{equation} \forall a\in(0,\beta]\quad \mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a)\,. \end{equation}
Notice that if \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\), it is \(\beta'\)-smoother for any \(\beta' \in (0, \beta]\). Furthermore, if \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\), it verifies that \(\bm{\theta}'\) cannot be \(\beta'\)-smoother than \(\bm{\theta}\) for any \(\beta' > 0\). This property can be used to reasonably compare degrees of smoothness between models in the same or different model spaces. It is important to notice that the concept of smoothness is defined in the context of a given data-generating distribution and loss function.
According to the above definition, the higher the rate function, the smoother the model when comparing it to others. Furthermore, according to Chernoff’s bound (see Theorem 5.14 and Equation \(\eqref{eq:asympoticEquality}\)), smoother models will have an empirical loss \(\hat{L}(D, \bm{\theta})\) more concentrated around its expected value \(L(\bm{\theta})\). In consequence, it is more unlikely to observe higher differences between \(\hat{L}(D, \bm{\theta})\) and \(L(\bm{\theta})\) in smoother models. In fact, the following result states that there is a correspondence between the variance of the model’s loss and the newly introduced notion of smoothness.
Proposition 5.22. For any \(\bm{\theta} \in \bm{\Theta}\) and \(\bm{\theta}'\in\bm{\Theta}'\), \begin{equation} \exists \beta > 0 \text{ s.t. } \bm{\theta} \text{ is } \beta\text{-smoother than }\bm{\theta}' \iff \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}' ))\,. \end{equation}
Proof
Let \(\bar{Z}_n := \sqrt{n} (L(\bm{\theta}) - L(D,\bm{\theta}))\) and \(J_{\bar{Z}_n}, {\cal I}_{\bar{Z}_n}\) and \({\cal I}^{-1}_{\bar{Z}_n}\) denote its cumulant, rate, and inverse rate function. Then, by properties of the cumulant, it verifies that \(J_{\bar{Z}_n}(\lambda) = nJ_{\bm{\theta}}\big(\tfrac{\lambda}{\sqrt{n}}\big)\). Using the definition of the rate function: \begin{equation} {\cal I}_{\bar{Z}_n}(a) =\sup_{\lambda} \ a\lambda+n J_{\bm{\theta}}\big(\tfrac{\lambda}{\sqrt{n}}\big)=n\sup_{\lambda} \ \tfrac{\lambda}{\sqrt{n}}\tfrac{a}{\sqrt{n}}+ J_{\bm{\theta}}\big(\tfrac{\lambda}{\sqrt{n}}\big)=n\mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}})\,. \end{equation} Using a second order Taylor expansion over \(\mathcal{I}_{\bm{\theta}}(a)\) around \(a=0\), where \(\mathcal{I}_{\bm{\theta}}(0) = 0\) and \(\frac{\partial}{\partial a}\mathcal{I}_{\bm{\theta}}(0) = 0\), it verifies that \begin{equation} \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) = \tfrac{1}{2} \tfrac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) \big(\tfrac{a}{ \sqrt{n}}\big)^2 + \big(\tfrac{a}{ \sqrt{n}}\big)^2 h\big(\tfrac{a}{ \sqrt{n}}\big)\,, \end{equation} where \(\lim_{t\to 0} h(t) = 0\). Furthermore, it verifies that \begin{equation} \tfrac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) = \left( \tfrac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(0)\right)^{-1}= \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-1} =: \sigma^{-2}\,. \end{equation} Then, \begin{equation} \label{eq:limitrate} \lim_{n\rightarrow\infty} {\cal I}_{\bar{Z}_n}(a) = \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) = \lim_{n\rightarrow\infty} n \frac{1}{2} \sigma^{-2} \big(\tfrac{a}{ \sqrt{n}}\big)^2 + n \big(\tfrac{a}{ \sqrt{n}}\big)^2 h\big(\tfrac{a}{ \sqrt{n}}\big)=\frac{1}{2} \sigma^{-2} a^2\,. \end{equation}
Let us assume that \(\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}' ))\). Then, the aim is to show that there exists \(\beta > 0\) such that \(\mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a) \quad \forall a \leq \beta\). For a fixed value of \(a\), by the limit in Equation \(\eqref{eq:limitrate}\), it verifies that for any \(\epsilon > 0\), there exists \(n_0(a), n_0'(a) > 0\) such that \begin{equation} \big| n \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) - \tfrac{1}{2}\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-2}a^2\big| < \epsilon \quad \forall n > n_0(a)\,, \end{equation} and \begin{equation} \big| n \mathcal{I}_{\bm{\theta}'}(\tfrac{a}{\sqrt{n}}) - \tfrac{1}{2}\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))^{-2}a^2\big| < \epsilon \quad \forall n > n_0'(a)\,. \end{equation} Let \(\epsilon = \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-2}a^2 - \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))^{-2}a^2 \geq 0\) and \(n_0^\star(a) = \max\{n_0(a), n_0'(a)\}\). Then, \begin{equation} \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) \geq \mathcal{I}_{\bm{\theta}'}(\tfrac{a}{\sqrt{n}}) \quad \forall n > n_0^\star(a) \,. \end{equation} This implies that \(\mathcal{I}_{\bm{\theta}}(c) \geq \mathcal{I}_{\bm{\theta}'}(c)\), for any \(c \leq a/\sqrt{n_0^\star(a)}\), as there exists \(n = (a/c)^2\) verifying \(c = a/\sqrt{n}\). Consider then \(\beta := \sup_{a> 0 } \{a/\sqrt{n_0^\star(a)}\} > 0\), which might be infinite. It verifies that \begin{equation} \mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a) \quad \forall a \leq \beta\,. \end{equation} Let us now assume that there exists \(\beta > 0\) such that \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\), or, equivalently \(\mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a) \quad \forall a \leq \beta\). Then, \begin{equation} \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) \geq \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}'}(\tfrac{a}{\sqrt{n}}) \,. \end{equation} Using the limit in Equation \(\eqref{eq:limitrate}\), \begin{equation} \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-2}a^2 \geq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))^{-2}a^2\,, \end{equation} leading to \begin{equation} \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))\,. \end{equation} □
This result shows the relation between the smoothness and the variance of the loss function between a pair of models \(\bm{\theta}\) and \(\bm{\theta}'\). As a result, this notion of smoothness can be intuitively understood as a generalization of the variance of the loss of a model.
Using this definition of smoothness and the results presented in Section 5.3.4.1, the following result provides an answer to which interpolator generalizes better.
Theorem 5.23. For any \(\epsilon\geq 0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta} \in \bm{\Theta} \subset \mathbb{R}^p\) and \(\bm{\theta}'\in\bm{\Theta}'\), simultaneously, \begin{equation} \text{if $\hat{L}(D, \bm{\theta})\leq \epsilon$ and $\bm{\theta}$ is \ \(\mathcal{I}^{-1}_{\bm{\theta}} \big(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}\big)\)-smoother than $\bm{\theta}'$, then, $L(\bm{\theta})\leq L(\bm{\theta}')+\epsilon$}\,. \end{equation}
Proof
If \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\), by Definition 5.21, \(\forall a\in(0,\beta] \quad \mathcal{I}_{\bm{\theta}}(a)\geq \mathcal{I}_{\bm{\theta}'}(a)\), where \(\beta = \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Then, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} As the rate function \(\mathcal{I}_{\bm{\theta}'}(a)\) is invertible and its image lies in \([0,L(\bm{\theta}')-m_{\bm{\theta}'})\), where \(m_{\bm{\theta}'}\geq 0\), due to Assumption 5.9, it verifies that \(\mathcal{I}^{-1}_{\bm{\theta}'}(s)\in[0,L(\bm{\theta}')-m_{\bm{\theta}'})\). In consequence, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq L(\bm{\theta}')\,. \end{equation} As \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\), it verifies \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq L(\bm{\theta}')\). By the PAC-Chernoff bound of Theorem 5.17 and because \(L(D,\bm{\theta})\leq \epsilon\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), \begin{equation} L(\bm{\theta})\leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} Combining the last two inequalities, \begin{equation} L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \leq \epsilon + L(\bm{\theta}')\,. \end{equation} The statement of the theorem directly derives from the above inequality. □
Note that, due to Proposition 5.22, the smoothness condition given by “\(\mathcal{I}^{-1}_{\bm{\theta}} \big(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}\big)\)-smoother” in Theorem 5.23 can be understood as an inequality between the variances of the loss function under the hypothesis that \(n\) is sufficiently large.
In fact, if \(\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}' ))\), there exists \(\beta > 0\) such that \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\). Then, if \(n\) is large enough to verify that \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \beta\), it verifies that, with h.p., if \(\hat{L}(D, \bm{\theta})\leq \epsilon\), then, \(L(\bm{\theta})\leq L(\bm{\theta}')+\epsilon\). In short, Theorem 5.23 is a generalization of the following result: if \(n\) is large enough, with h.p., \begin{equation} \text{ if $\hat{L}(D, \bm{\theta})\leq \epsilon$ and } \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}' )) \text{, then, } L(\bm{\theta})\leq L(\bm{\theta}')+\epsilon\,. \end{equation}
Theorem 5.23 states that an interpolator generalizes better than another (with h.p.) if it is sufficiently smoother in terms of its rate function. Figure 5.11 illustrates the premise of this theorem. Notice that the above result holds even for the log-loss, which is the default loss used for training, and for over-parameterized model classes. This result is specially useful when \(\epsilon\) is very small or null, as it states, with h.p., that smooth interpolators generalizes better, up to an \(\epsilon\), than other less smooth models; independently of whether these interpolate the data or not. Furthermore, the above result verifies that the higher the probability \(1- \delta\) or the number of parameters \(p\), the stronger the smoothness condition needs to be; and the opposite for larger \(n\). In that sense, there might exist interpolators which are \(\beta\)-smoother than others but have worse generalization performance, because they are not smooth enough in order to apply Theorem 5.23. In summary, interpolators with a larger rate function \({\cal I}_{\bm{\theta}}(\cdot)\) or, equivalently, smoother interpolators, are the ones that better generalize.
Figure 5.12 shows a synthetical example to highlight how the notion of smoothness depends on the specific data-generating distribution, and why one model is smoother than another only relative to such distribution. The example considers two different data generating distributions \(\nu_1\) (first row, which adds random noise to a linear function) and \(\nu_2\) (second row, adds random noise to a complex sinusoidal function). The considered loss is the mean squared error. The introduced notion of smoothness using the rate function shows how \(\bm{\theta}_1\) (a linear model) is smoother than \(\bm{\theta}_2\) (a more complex model) under \(\nu_1\). In fact, under \(\nu_1\), the distribution of \(\hat{L}(D, \bm{\theta}_1)\) is more concentrated around a smaller mean value. However, the second row of this example also shows how the more complex model \(\bm{\theta}_2\) can be smoother than a linear model \(\bm{\theta}_1\) under a different data-generating distribution \(\nu_2\) for exactly the same reasons. With this example, it is highlighted that a good notion of smoothness (as the one presented here) should consider the data in which the models are being evaluated, rather than just the complexity of the function they induce.
The results from this section clearly indicate that the generalization error of an interpolator is defined by its level of smoothness and its rate function. The question is whether this theoretical characterization, which relies on distribution-dependent quantities, is useful for understanding the inner workings of current learning techniques and complex phenomena appearing in deep learning. In Section 5.3.5, it is shown how the PAC-Chernoff bound and the novel smoothness criteria are powerful enough to analyze the so-called double-descent phenomenon (Belkin et al., 2019). In Section 5.3.6, the relationship between the smoothness of a model and widely used regularization techniques—such as the parameter norm of a model, distance from initialization, input-gradient norm, and Lipschitz constant—is examined. Many of these quantities have also been previously used as measures of smoothness or complexity of a model (Neyshabur et al., 2017a). In Section 5.3.7, the fact that invariant architectures and data augmentation induce smoother interpolators is studied. Finally, in Section 5.3.8, over-parameterization is revisited as a necessary condition for having smoother interpolators.
Understanding Double-Descent with PAC-Chernoff Bounds
The existing literature consistently demonstrates that interpolators with a larger number of parameters tend to perform better. A finding that defies traditional beliefs from classical statistical learning theory. Traditionally, it was assumed that increasing the number of parameters in a model would lead to higher overfitting and, consequently, poor generalization performance. However, this perspective has been challenged by the phenomenon of the double-descent curve (Belkin et al., 2019), which exemplifies a shift in dynamics as models enter the interpolation regime. In this regime, an increase in parameters paradoxically leads to improved performance. This counter-intuitive behavior suggests that once models begin to interpolate, their generalization performance improves as they grow larger.
| Size | Train Acc. | Test Acc. | Test NLL | Bound |
|---|---|---|---|---|
| \(4\) k | \(43.82\%\) | \(42.74\%\) | \(1.56\) | \(2.93\) |
| \(32\) k | \(70.84\%\) | \(59.86\%\) | \(1.17\) | \(2.01\) |
| \(85\) k | \(86.69\%\) | \(62.73\%\) | \(1.35\) | \(1.77\) |
| \(163\) k | \(97.65\%\) | \(63.97\%\) | \(1.82\) | \(1.94\) |
| \(266\) k | \(99.84\%\) | \(65.19\%\) | \(2.15\) | \(2.19\) |
| Size | Train Acc. | Test Acc. | Test NLL | Bound |
|---|---|---|---|---|
| \(395\) k | \(100.0\%\) | \(65.74\%\) | \(2.28\) | \(2.30\) |
| \(548\) k | \(100.0\%\) | \(67.27\%\) | \(2.10\) | \(2.11\) |
| \(727\) k | \(100.0\%\) | \(69.88\%\) | \(1.84\) | \(1.86\) |
| \(931\) k | \(100.0\%\) | \(69.11\%\) | \(1.83\) | \(1.84\) |
| \(1161\) k | \(100.0\%\) | \(69.82\%\) | \(1.72\) | \(1.74\) |
Figure 5.13 illustrates this phenomenon by showing the training and the test loss of a sequence of convolutional networks with a growing number of parameters (implemented by adding more channels to the layers of a simple convolutional model). More precisely, models with a number of parameters ranging from \(4k\) to \(1\, 161k\). All models were found by running stochastic gradient descent on CIFAR10’s training data, until the training loss reaches \(0.01\) or until it did not improve in two consecutive epochs of training.
Figure 5.13 also displays the evolution of the PAC-Chernoff bound for the whole range of models. As theoretically shown, this bound is perfectly tight for interpolators. In the classical regime, the bound also presents the so-called double descent phenomena, even though its tightness is not guaranteed. This highlights how distribution-dependent bounds are powerful enough to capture the complex dynamics emerging in the interpolation regime.
Open Question 5.3. Why does the generalization performance of interpolators improve with an increasing number of parameters?
The complexity term of the PAC-Chernoff bound \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta})\) monotonically increases with the size of the mode class, which, in principle, would contradict the double-descent phenomena on the bound and the fact that generalization error is reduced as the number of models increases. The following result, viewed as an extension of Theorem 5.23, offers a partial explanation for this puzzling phenomena. More precisely, it demonstrates that if the interpolators become progressively smoother (higher rate function), their generalization error will be reduced, despite being part of a model class characterized by a greater number of parameters.
Theorem 5.24. Let \(\bm{\Theta}\subset \bm{\Theta}'\) be two nested model classes with \(p < p'\) parameters respectively. For any \(\epsilon>0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), for any \(\bm{\theta}'\in\bm{\Theta}'\), \(\bm{\theta}\in\bm{\Theta}\), simultaneously \begin{equation} \begin{aligned} \text{ if } \hat{L}(D, \bm{\theta}')\leq\epsilon \,,\, \hat{L}(D, \bm{\theta}) \leq \epsilon \, \text{ and }\, \bm{\theta}' \, &\text{ is } \,\mathcal{I}^{-1}_{\bm{\theta}'}\big(\tfrac{1}{n}\log \tfrac{k^{p'}}{\delta}\big)\text{-smoother than } \, \bm{\theta}\\ &\Downarrow\\ L(\bm{\theta}')\leq &\, L(\bm{\theta})+\epsilon\,. \end{aligned} \end{equation}
Proof
If \(\bm{\theta}'\) is \(\beta\)-smoother than \(\bm{\theta}\), by Definition 5.21, \(\forall a\in(0,\beta] \quad \mathcal{I}_{\bm{\theta}'}(a)\geq \mathcal{I}_{\bm{\theta}}(a)\) where \(\beta = \mathcal{I}^{-1}_{\bm{\theta}'}(\frac{1}{n}\ln\frac{k^{p'}}{\delta})\). Then, we have that \(\mathcal{I}^{-1}_{\bm{\theta}'}(s)\leq \mathcal{I}^{-1}_{\bm{\theta}}(s)\) for \(s=\mathcal{I}_{\bm{\theta}'}(\beta) = \mathcal{I}_{\bm{\theta}'}\Big(\mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\Big) = \tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta}\). Thus, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq L(\bm{\theta})\,. \end{equation} Where we used that the inverse rate is upper bounded by \(L(\bm{\theta})\). By Theorem 5.17, because \(L(D,\bm{\theta})\leq \epsilon\), we have with h.p. \(1-\delta\) over \(D\sim\nu^n\), \begin{equation} L(\bm{\theta}')\leq L(D,\bm{\theta}') + \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\,. \end{equation} By combining the last two inequalities, we have \begin{equation} L(\bm{\theta}')\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta}) \leq \epsilon + L(\bm{\theta})\,. \end{equation} □
Figure 5.14 (right) illustrates the rate functions of the interpolators from Figure 5.13, highlighting how larger interpolators become increasingly smoother, meaning their rate functions are progressively higher (i.e., they are progressively smoother). This explains how it is possible to obtain larger interpolators with smaller generalization error. In fact, by reversing the implication of Theorem 5.24, if a larger interpolator exhibits better generalization performance, then it cannot be less smooth than a smaller interpolator.
Corollary 5.25. Let \(\bm{\Theta}\subset \bm{\Theta}'\) be two nested model classes with \(p < p'\) parameters respectively. For any \(\epsilon>0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), for any \(\bm{\theta}'\in\bm{\Theta}'\), \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \begin{aligned} \hat{L}(D, \bm{\theta}')\leq\epsilon, \ \hat{L}(D, \bm{\theta})\leq&\,\epsilon \text{ and } L(\bm{\theta}') + \epsilon < L(\bm{\theta})\\ &\Downarrow\\ \bm{\theta} \text{ is not } \mathcal{I}_{\bm{\theta}}^{-1}\big(\tfrac{1}{n}\log&\tfrac{k^{p'}}{\delta}\big)\text{-smoother than } \bm{\theta}'\,. \end{aligned} \end{equation}
Proof
We can apply Theorem 5.17 on \(\bm{\theta}\) assuming that \(\bm{\theta}\in\bm{\Theta}'\), because we have that \(\bm{\Theta}\subset \bm{\Theta}'\). In consequence, with h.p., we have \begin{equation} L(\bm{\theta}) \leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta}) \end{equation} Using the fact that \(L(D,\bm{\theta})\leq\epsilon\), and that the inverse rate is bounded by the expected loss, we arrive to the following inequality \begin{equation} L(\bm{\theta}) \leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \epsilon + L(\bm{\theta}) \end{equation} The same reasoning applies for \(\bm{\theta}'\): \begin{equation} L(\bm{\theta}') \leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \epsilon + L(\bm{\theta}') \end{equation} Using the theorem’s premise, \(L(\bm{\theta}') + \epsilon< L(\bm{\theta})\), we can chain the last two h.p. upper bounds. And note that we can do that with no extra cost, as both apply on the same bigger model class \(\bm{\Theta}'\). That is, it is the same upper bound, which holds simultaneously for all models within \(\bm{\Theta}'\), used on two different models \(\bm{\theta}\) and \(\bm{\theta}'\). Then, we have, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})< \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\,. \end{equation} Naming \(a=\mathcal{I}^{-1}_{\bm{\theta}}(\frac{1}{n}\ln\frac{k^{p'}}{\delta})\), such that \(\frac{1}{n}\ln\frac{k^{p'}}{\delta}=\mathcal{I}_{\bm{\theta}}(a)\), we got that \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\mathcal{I}_{\bm{\theta}}(a))< \mathcal{I}^{-1}_{\bm{\theta}}(\mathcal{I}_{\bm{\theta}}(a))= a\,. \end{equation} Applying \(\mathcal{I}_{\bm{\theta}'}(\cdot)\) at both sides: \begin{equation} \mathcal{I}_{\bm{\theta}'}(\mathcal{I}^{-1}_{\bm{\theta}'}(\mathcal{I}_{\bm{\theta}}(a))) = \mathcal{I}_{\bm{\theta}}(a) < \mathcal{I}_{\bm{\theta}'}(a)\,. \end{equation} Then, we can negate the inverse statement using \(\neg\) notation as \(\neg\Big[ \mathcal{I}_{\bm{\theta}}(a)\geq \mathcal{I}_{\bm{\theta}'}(a)\Big]\). Then we have that \(\bm{\theta}\) is not \(\beta\)-smoother than \(\bm{\theta}'\) for any \(\beta\geq a\). Which is equivalent to say that, \(\bm{\theta}\) is not \(\beta\)-smoother than \(\bm{\theta}'\) for any \(\beta\) such that \(\mathcal{I}_{\bm{\theta}}(\beta)\geq \frac{1}{n}\ln\frac{k^{p'}}{\delta}\). □
Consequently, Open Question 5.3 can be reformulated in terms of model smoothness rather than generalization performance.
Open Question 5.4. Why does the smoothness of interpolators improve with an increasing number of parameters?
Based on Theorem 5.24 and Corollary 5.25, addressing Open Question 5.4 is equivalent to answering Open Question 5.3. In other words, by answering why the rate function of the interpolators is increasingly higher as the number of parameters in nested model classes increases, Theorem 5.24 will effectively explain why the double descent phenomenon, and in particular, why the generalization performance of interpolators enhances as the number of parameters grows. However, to the best of current knowledge, there is no direct answer to this behavior without incurring additional hypotheses on the matter.
The proposed rationale for explaining this phenomenon posits that, by increasing the number of parameters, larger neural networks are better at capturing invariances in the data, a notion that has been supported both theoretically (Bronstein et al., 2021) and empirically (Goodfellow et al., 2009) by the machine learning community. Additionally, as will be shown in Section 5.3.7, this capability of capturing invariances leads to smoother models with higher rate functions. Combining these two factors, Theorem 5.24 can help to elucidate why the generalization performance of interpolators improves with an increase in parameters. The experimental setup presented in Figures 5.13 and Figure 5.14 supports this understanding.
Finally, it is also quite interesting to look at the models on the left part of the interpolation threshold of Figure 5.13. Surprisingly, the sequence of rate functions, displayed in Figure 5.14 (left), gets increasingly lower as the size of the mode class increases, just the opposite of what happens in Figure 5.14 (right). In the classical regime, increasing the number of parameters leads to less smooth models with higher generalization error. However, explaining this behavior in the non-interpolation regime is out of the scope of this work, as the presented PAC-Chernoff bound is not tight in that regime and no theoretical guarantees can be derived in such framework. However, it is worth noticing that, even with the lack of tightness guarantees, the proposed bound is close enough to showcase the double descent phenomenon itself.
In the following section, the proposed oracle bound is connected with many existing regularization techniques, allowing to explain, in a unified framework, the effectiveness in terms of generalization.
Explicit Regularization
Structural risk minimization (Shawe-Taylor et al., 1998) is a learning principle based on regularizers that penalizes the complexity of a model when minimizing the training loss. Let \(r(\bm{\theta})\) denote a regularizing function, the learning objective is then given by, \begin{equation} \label{eq:structuralriskminimization} \min_{\bm{\theta}\in\bm{\Theta}} \ \hat{L}(D, \bm{\theta}) + r(\bm{\theta})\,. \end{equation} High-probability bounds of the form \(L(\bm{\theta})\leq \hat{L}(D, \bm{\theta}) + {\cal C}(\bm{\theta},D,\delta)\), have been traditionally used to justify the use of regularizers and to derive novel ones. According to these bounds, an obvious choice for regularizers is the complexity term, \(r(\bm{\theta}) = {\cal C}(\bm{\theta},D,\delta)\). Then, the structural risk minimization problem given in Equation \(\eqref{eq:structuralriskminimization}\) would correspond to minimizing a high-probability upper bound over the expected loss \(L(\bm{\theta})\). Many regularizing methods have been studied from this perspective. One of the most common ones is the \(\ell^2\)-norm regularization, which recurrently shows up within the complexity term of many generalization bounds (Bartlett et al., 2017). However, as discussed in the introduction, these bounds are known to be loose in the context of over-parameterized model classes interpolating the data. In consequence, the rationale of using their associated complexity measures \({\cal C}(\bm{\theta},D,\delta)\) as regularizers is no longer as strong as it used to be in learning setups where the model class is not over-parameterized.
Open Question 5.5. Is there a regularizer \(r(\bm{\theta})\) that ensures near-optimal performance for over-parameterized model classes interpolating the training data?
The following result provides an answer to Open Question 5.5. It shows that the complexity measure of the PAC-Chernoff bound given in Theorem 5.17 can be used as a regularizer. That is, using the inverse rate function \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta})\), an optimal regularizer for over-parameterized interpolators is obtained. More precisely, for any \(\epsilon>0\), let \(\bm{\theta}^\star_\epsilon\) and \(\bm{\theta}^{\times}_\epsilon\) denote the interpolator with the best generalization performance and the interpolator with the smallest inverse rate function, respectively, \begin{equation} \label{eq:optimalinterpolators} \bm{\theta}^\star_\epsilon = \argmin_{\bm{\theta}\,:\,\hat{L}(D, \bm{\theta})\, \leq \, \epsilon}\, L(\bm{\theta})\,, \quad\quad \bm{\theta}^{\times}_\epsilon = \argmin_{\bm{\theta}\,:\,\hat{L}(D, \bm{\theta})\, \leq \, \epsilon} \, \hat{L}(D, \bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta})\,. \end{equation} With h.p., the expected loss of the interpolator with the smallest inverse rate function \(\bm{\theta}^{\times}_\epsilon\) is very close to the expected loss of the optimal interpolator, \(\bm{\theta}^\star_\epsilon\).
Theorem 5.26. For any \(\epsilon>0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), \(|L(\bm{\theta}^\star_\epsilon) - L(\bm{\theta}^{\times}_\epsilon)|\leq \epsilon\).
Proof
From the definitions of \(\bm{\theta}^{\times}_\epsilon\) and \(\bm{\theta}^\star_\epsilon\), given by \begin{equation} \bm{\theta}^\star_\epsilon = \argmin_{\bm{\theta}\,:\,L(D,\bm{\theta})\, \leq \, \epsilon}\, L(\bm{\theta})\,, \quad\quad \bm{\theta}^{\times}_\epsilon = \argmin_{\bm{\theta}\,:\,L(D,\bm{\theta})\, \leq \, \epsilon} \, L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,, \end{equation} it is clear that \(L(\bm{\theta}^\star_\epsilon) \leq L(\bm{\theta}^{\times}_\epsilon)\). On the other hand, because for any \(s\geq 0\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\in[0,L(\bm{\theta})-m_{\bm{\theta}})\) (Proposition 5.13), it verifies that \(\forall\bm{\theta}\in\bm{\Theta}\): \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + L(D,\bm{\theta}) \leq L(\bm{\theta}) +L(D,\bm{\theta})\,. \end{equation} By definition of \(\bm{\theta}^{\times}_\epsilon\) and \(\bm{\theta}^{\star}_\epsilon\), \begin{equation} \mathcal{I}_{\bm{\theta}^{\times}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\times}_\epsilon) \leq \mathcal{I}_{\bm{\theta}^{\star}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\star}_\epsilon)\,, \end{equation} which gives \begin{equation} \mathcal{I}_{\bm{\theta}^{\times}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\times}_\epsilon) \leq L(\bm{\theta}^{\star}_\epsilon) + \hat{L}(D, \bm{\theta}^{\star}_\epsilon) \leq L(\bm{\theta}^{\star}_\epsilon) + \epsilon\,. \end{equation} This, in combination with the PAC-Chernoff bound of Theorem 5.17 gives \begin{equation} L(\bm{\theta}^{\times}_\epsilon) \leq \mathcal{I}_{\bm{\theta}^{\times}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\times}_\epsilon) \leq L(\bm{\theta}^{\star}_\epsilon) + \epsilon\,. \end{equation} From this, \(L(\bm{\theta}^{\times}_\epsilon)\leq L(\bm{\theta}^\star_\epsilon) +\epsilon\). Thus, \(L(\bm{\theta}^\star_\epsilon) \leq L(\bm{\theta}^{\times}_\epsilon)\) and \(L(\bm{\theta}^{\times}_\epsilon)\leq L(\bm{\theta}^\star_\epsilon) +\epsilon\), finishing the proof. □
In simpler terms, favoring models with a small inverse rate function is the same as favoring models with a larger rate function, which essentially means choosing smoother models. This finding highlights that the smoothest interpolator not only fits the data but also performs nearly optimally when it comes to generalization. The inverse rate function \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta})\) is an optimal regularizer for over-parameterized interpolators.
Connecting the Inverse Rate with Existing Regularization Techniques
As shown in the previous sections, the inverse rate function, a distribution-dependent and oracle element, is an optimal regularizer for interpolators. However, in this section, the discussion will focus on how a wide range of existing regularizers are tightly connected to the inverse rate and, in consequence, to the proposed definition of smoothness. The aim of this thesis is not to fully explore these connections, nor to claim that the newly derived bounds are better than existing ones. The main focus is to show how the (inverse) rate function of a model can be used to unify many existing regularization techniques which were previously assumed to be unrelated.
Norm \(\ell_2\) Regularization
This regularizer, also known as weight decay, is widely used by the machine learning community and it is known to improve generalization even in over-parameterized model classes interpolating the training data (Goodfellow et al., 2016). The following result defines a connection between the inverse rate of a model and its norm, under the widely used assumption that the loss function \(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) is Lipschitz with respect to the parameters of the model \(\bm{\theta}\) (Li & Orabona, 2019).
Proposition 5.27. If the loss function \(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) is Lipschitz w.r.t. \(\bm{\theta}\) with constant \(M > 0\), then, for any \(\bm{\theta}_0 \in \bm{\Theta}_0 = \{\bm{\theta} \in \bm{\Theta} \ | \ \mathbb{V}_{\nu}(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) = 0\}\), it verifies that \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \sqrt{2Ma}\ \|\bm{\theta} - \bm{\theta}_0\|_2\,, \end{equation} where \(a = min\big(1 , \tfrac{1}{n}\log\tfrac{k^p}{\delta} \big)\).
Proof
If the loss is Lipschitz continuous with constant \(M\), \(\forall y,x,\bm{\theta}\quad \|\nabla_{\bm{\theta}} \ell(\mathbf{y},\mathbf{x},\bm{\theta})\|^2_2\leq M\). Then, \(J_{\bm{\theta}}(\lambda)\) verifies \begin{align} \|\nabla_{\bm{\theta}}J_{\bm{\theta}}(\lambda)\|^2_2&=||-\lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] + \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) \right]||_2^2\\ &\leq \lambda^2 \mathbb{E}_{\nu p^\lambda}\left[\|\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\|_2^2\right] + \lambda^2\mathbb{E}_\nu\left[\|\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) \|_2^2\right]\\ &\leq 2M \lambda^2\,, \end{align} where \(\mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] = \frac{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})]}{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]}\). With this, \begin{equation} |J_{\bm{\theta}}(\lambda) - J_{\bm{\theta}_0}(\lambda)|\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2 \implies J_{\bm{\theta}}(\lambda)\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2\,. \end{equation} Then, for any \(a \geq 0\), it verifies \(\frac{a+J_{\bm{\theta}}(\lambda)}{\lambda}\leq \frac{a+2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2}{\lambda}\); where by definition of the inverse rate, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(a) \leq \frac{a+J_{\bm{\theta}}(\lambda)}{\lambda}\leq \frac{a+2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2}{\lambda}\,. \end{equation} As the inequality holds for any \(\lambda \geq 0\), take the one that minimizes the r.h.s, leading to \(\mathcal{I}^{-1}_{\bm{\theta}}(a)\leq \sqrt{2M a}\|\bm{\theta}-\bm{\theta}_0\|_2\). On the other hand, \(\mathcal{I}^{-1}_{\bm{\theta}}(a)\) is Lipschitz with constant \(M\) as \begin{equation} \| \nabla_{\bm{\theta}} \mathcal{I}^{-1}_{\bm{\theta}}(a)\|^2_2 = \Big\|\frac{\nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda^\star_a)}{\lambda^\star_a}\Big\|^2_2\leq \frac{\|\nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda^\star_a)\|^2_2}{\lambda^{\star,2}_a}\leq 2 M\,. \end{equation} In consequence, \(\textstyle (\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})-{\cal I}^{-1}_{\bm{\theta}_0}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}))^2 \leq 2M ||\bm{\theta} - \bm{\theta}_0||^2\). Which implies that \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \leq \sqrt{2M} ||\bm{\theta} - \bm{\theta}_0||_2\). Thus, it simultaneously holds that \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \leq \sqrt{2M} ||\bm{\theta} - \bm{\theta}_0||_2 \quad \text{and} \quad \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \sqrt{2M \tfrac{1}{n}\ln\tfrac{k^p}{\delta}}\|\bm{\theta}-\bm{\theta}_0\|_2\,. \end{equation} □
In many common machine learning models, the null vector is a model with null variance, that is, \(\mathbf{0} \in\bm{\Theta}_0\). For example, in supervised classification problems with \(K\) labels, a neural network with null weights has a constant loss equal to \(\log K\). In these cases, the above result shows that by promoting models with small parameter norm, models with small inverse rate are achieved. In consequence, the \(\ell_2\)-norm is a proxy to minimize the inverse rate and works as a regularizer. Figure 5.9 illustrates an example of how the use of \(\ell_2\)-norm regularization leads to models with small parameter norm and larger rate function. This aligns with the existing evidence in the literature about the use of \(\ell_2\)-norm for successfully regularizing interpolators.
However, in the context of over-parameterized models interpolating the training data, it is well known that the \(\ell_2\)-norm of the parameters of the model does not correlate well with the generalization error (Jiang et al., 2020). Consider that the \(\ell_2\)-norm just defines an upper bound over the inverse rate, which, as shown in Section 5.3.4.1 is an optimal proxy for the generalization of interpolators. The gap of this upper bound is the reason behind the failure of \(\ell_2\)-norm as a proxy for the generalization error of an interpolator. In fact, an open question in machine learning is:
Open Question 5.6. When does the norm of an interpolator correlate with its generalization error?
The following result provides a characterization of a learning setup where these two elements are correlated.
Corollary 5.28. Considering the log-loss, if it belongs to the exponential family with a constant base measure, that is, there exist \(s:\mathcal{X}\times \mathcal{Y} \to \mathbb{R}^p\), \(a:\bm{\Theta} \to \mathbb{R}\) and \(k \in \mathbb{R}\) such that \begin{equation} \ell(\mathbf{y},\mathbf{x},\bm{\theta}) := -\log p(\mathbf{y}|\mathbf{x}, \bm{\theta}) = \bm{\theta}^T s(\mathbf{x}, \mathbf{y}) - a(\bm{\theta}) + k\quad \forall (\mathbf{x}, \mathbf{y}) \sim \nu\,. \end{equation} Then, \(\forall \epsilon > 0\) exists \(n_0 > 0\) such that \(\forall n > n_0\): \begin{equation} \left|\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) - \sqrt{2\tfrac{1}{n}\log\tfrac{k^p}{\delta}} \sqrt{\bm{\theta}^T \text{Cov}_\nu(s(\mathbf{y}, \mathbf{x})) \bm{\theta}}\right|\leq \epsilon\,, \end{equation} where \(\text{Cov}_{\nu}(\cdot)\) is the covariance w.r.t. \(\nu\) of the sufficient statistics of each \((\mathbf{x},\mathbf{y})\) sample.
Proof
Using Theorem 5.19, we got that \(\lim_{n\rightarrow\infty} \ \sqrt{n}\, \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) = \sqrt{2\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\ln \tfrac{k^p}{\delta}}\). The proof concludes from the fact that using the exponential family, \begin{equation} \mathbb{V}_\nu\big(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) \big) = \mathbb{V}_\nu\big(\bm{\theta}^T s(\mathbf{y},\mathbf{x}) - a(\bm{\theta}) + k\big) = \bm{\theta}^T\text{Cov}_{\nu} \big(s(\mathbf{y},\mathbf{x})\big) \bm{\theta}\,. \end{equation} □
This result shows that under the exponential family and a large enough number of data samples, the inverse rate is close to the scaled Riemannian norm of the model parameters under the positive definite matrix \(\text{Cov}_\nu(s(\mathbf{y}, \mathbf{x}))\) that defines the Riemannian metric. Note that, in this case, the Riemannian norm will be an optimal regularizer as Theorem 5.26 applies and the inverse rate is a perfectly tight proxy for the generalization error of interpolators as shown in Section 5.3.4.1. The \(\ell_2\)-norm is obtained when the covariance matrix is a diagonal matrix with constant entries, a condition that requires the statistical independence of the components of the sufficient statistics \(s(\mathbf{y}, \mathbf{x})\) with respect to the data-generating distribution.
In short, Proposition 5.27 shows how the \(\ell_2\)-norm of the parameters of the model upper bounds its inverse rate and, then, it shows why biasing the optimizer towards models with smaller \(\ell_2\)-norm tends to reduce the generalization error of the interpolators. However, Corollary 5.28 exemplifies a general setting to understand the gap in the inequality of Proposition 5.27 and how this gap can manifest if \(\text{Cov}_\nu(s(\mathbf{y}, \mathbf{x}))\), a distribution-dependent quantity, is very different from a diagonal matrix.
Distance From Initialization
The distance from initialization is known to be related to the generalization error of an interpolator (Nagarajan & Kolter, 2017). In fact, it has been successfully used as a regularizer in (Hu et al., 2020). In these works, an implicit assumption was that the initial models are produced through one of the well-established initialization schemes used in neural networks, such as Xavier or He initialization (Goodfellow et al., 2016). These parameter initialization schemes provide an initial set of weights which guarantee that the variance of the activations across the different layers remains constant. In consequence, the randomly initialized model makes almost constant predictions. If \(\bm{\theta}_0\) denotes the initial parameter, then, \(\mathbb{V}_{\nu}(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}_0))\) should be very small. Figure 5.10 (right) shows how this is the case for a randomly initialized Inception model using He initialization.
Under the assumption that the initial model obtained by common initializers belongs to \(\bm{\Theta}_0\), Proposition 5.27 also shows a link between distance from initialization and the inverse rate function and, in consequence, with the generalization error of a model. The same reasoning used with \(\ell_2\)-norm applies here: by promoting models with small distance from initialization, models with small generalization error are promoted.
Furthermore, as happens with the \(\ell_2\)-norm, in the context of over-parameterized interpolators, distance from initialization is not a reliable proxy for the generalization error (Jiang et al., 2020). Again, this is easy to explain considering that distance from initialization is just an upper bound over the inverse rate. In consequence, an open question in machine learning is:
Open Question 5.7. When does the distance from initialization of an interpolator perfectly correlate with its generalization error?
When the cumulant generating function of the model \(J_{\bm{\theta}}(\lambda)\) is a quadratic function or, equivalently, its second-order Taylor approximation around \(\bm{\theta}_0\in\bm{\Theta}_0\) is exact, as shown in the proof of Proposition 5.29, the cumulant generating function can be expressed as \begin{equation} \label{eq:secondorder} J_{\bm{\theta}}(\lambda)=\frac{1}{2}\lambda^2\big(\bm{\theta} - \bm{\theta}_0\big)^T \text{Cov}_{\nu} \big(\nabla_{\bm{\theta}} \log p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)\big)\big(\bm{\theta} - \bm{\theta}_0\big)\,, \end{equation} where \(\text{Cov}_{\nu}(\cdot)\) is the covariance w.r.t. \(\nu\) of the gradient of the log-likelihood of each \((\mathbf{x},\mathbf{y})\) sample. Under this hypothesis, the inverse rate can be expressed as stated next.
Proposition 5.29. For any \(\bm{\theta} \in \bm{\Theta}\), if the equality of Equation \(\eqref{eq:secondorder}\) holds, then \begin{equation} \textstyle \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) = \sqrt{2\tfrac{1}{n}\log\tfrac{k^p}{\delta}}\sqrt{\big(\bm{\theta} - \bm{\theta}_0\big)^T \text{Cov}_{\nu} \big(\nabla_{\bm{\theta}} \log p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)\big)\big(\bm{\theta} - \bm{\theta}_0\big)}\,. \end{equation}
Proof
A second-order Taylor expansion of \(J_{\bm{\theta}}(\lambda)\) w.r.t. to \(\bm{\theta}\) centered around \(\bm{\theta}_0\) is \begin{equation} J_{\bm{\theta}_0}(\lambda) + \nabla_{\bm{\theta}} J_{\bm{\theta}_0}(\lambda)(\bm{\theta}-\bm{\theta}_0) + \frac{1}{2} (\bm{\theta}-\bm{\theta}_0)^T\nabla_{\bm{\theta} \bm{\theta}}J_{\bm{\theta}_0}(\lambda)(\bm{\theta}-\bm{\theta}_0)\,. \end{equation} By standard properties of the cumulant generating function over centered random variables, it verifies that \(J_{\bm{\theta}_0}(\lambda)=0\). While the \(\nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda)\) can be expressed as, \begin{equation} \nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda) = \lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})\right] - \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) \right]\,, \end{equation} where \(\nu p^{\lambda}\) denotes \(\nu p^{\lambda}(\mathbf{y},\mathbf{x}) = \frac{\nu(\mathbf{y},\mathbf{x})p(\mathbf{y}|\mathbf{x},\bm{\theta})^{\lambda}}{\mathbb{E}_\nu[p(\mathbf{y}|\mathbf{x},\bm{\theta})^{\lambda}]}\). At \(\bm{\theta}_0\), it verifies that \(\nu p^{\lambda} = \nu\), because, by definition, \(p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)\) is constant for any \((\mathbf{y},\mathbf{x})\). In consequence, the gradient at \(\bm{\theta}_0\) simplifies as, \begin{equation} \nabla_{\bm{\theta}} J_{\bm{\theta}_0}(\lambda) = \lambda \mathbb{E}_{\nu}\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})\right] - \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) \right] = 0\,. \end{equation}
The Hessian of the cumulant \(J_{\bm{\theta}}(\lambda)\) w.r.t. to \(\bm{\theta}\) can be written as follows: \begin{align} \nabla_{\bm{\theta} \bm{\theta}}J_{\bm{\theta}}(\lambda) &= \lambda^2 Cov_{\nu p^\lambda} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) ) + \lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta} \bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})\right] \\ &\quad- \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta} \bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) \right]\,. \end{align} Again, at \(\bm{\theta}_0\), it verifies that \(\nu p^{\lambda} = \nu\), so the Hessian of the cumulant \(J_{\bm{\theta}}(\lambda)\) at \(\bm{\theta}_0\) simplifies as, \(\nabla_{\bm{\theta} \bm{\theta}}J_{\bm{\theta}_0}(\lambda) = \lambda^2 Cov_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0) )\). With this, the second order Taylor expansion of \(J_{\bm{\theta}}(\lambda)\) evaluated on \(\bm{\theta}_0\) is \(\frac{\lambda^2 }{2}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)\).
On the other hand, if, at the definition of \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) given in Equation \(\eqref{eq:inverseratefunction}\), replace \(J_{\bm{\theta}}(\lambda)\) by the above Taylor expansion, leading to \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) = \inf_{\lambda>0} \ \frac{s}{\lambda} + \frac{\lambda}{2}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)\,. \end{equation} The optimal value of \(\lambda\) is acquired taking derivatives on the above expression w.r.t. \(\lambda\), giving, \begin{equation} \frac{-s}{\lambda^2} + \frac{1}{2}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0) = 0\,. \end{equation} One may show that it is a minimum by taking the second derivative. From this, the optimal value of \(\lambda\) is \(\lambda^\star = \left(\frac{2s}{(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)}\right)^{\frac{1}{2}}\). Using this expression, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) = 2s \sqrt{\frac{1}{2s}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)}\,. \end{equation} Evaluating the expression on \(s = \tfrac{1}{n}\ln\tfrac{k^p}{\delta}\) and \(\bm{\theta} = \mathbf{0}\) concludes the proof. □
When training highly over-parameterized models using gradient descent, it is well known that parameters remain close to their initial values. This phenomenon is called lazy training and is a standard assumption in many theoretical analyses of deep neural networks (Jacot et al., 2018). However, when lazy training takes place, the second-order Taylor approximation of the cumulant could become highly accurate. The above proposition then suggests how the Riemannian distance, induced by the covariance matrix \(\text{Cov}_{\nu} \big(\nabla_{\bm{\theta}} \log p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)\big)\), would be an optimal proxy for the generalization error of an interpolator in the lazy training regime. This reasoning also helps to understand the limitations in this regard of the Euclidean distance.
Input-Gradient and Lipschitz Regularization.
Input-gradient regularization is a technique used in deep learning to improve the robustness of neural network models (Gouk et al., 2021). This approach focuses on the gradients of the model’s output with respect to its input, which are indicative of how sensitive the model’s predictions are to changes in the input data.
The following result shows how the inverse rate function and the expected norm of the input-gradient of a model are also related. This result is based on the log-sobolev inequalities (Chafaı̈, 2004) and relies on the assumption that \(\nu(\mathbf{y}|\mathbf{x})\) is deterministic, to simplify gradients with respect to the target variable, and that \(\nu(\mathbf{x}, \mathbf{y})\) is a uniformly strictly log-concave density, an assumption satisfied by a wide range of distributions.
Proposition 5.30. If \(\nu(\mathbf{x}, \mathbf{y})\) is a uniformly strictly log-concave density and \(\nu(\mathbf{y}|\mathbf{x})\) is deterministic, then \(\exists M > 0\), such that, for any \(\bm{\theta} \in \bm{\Theta}\), \begin{equation} \textstyle \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \sqrt{\tfrac{1}{n}\log\tfrac{k^p}{\delta}}\sqrt{M\mathbb{E}_\nu\Big[\big\Vert\nabla_{\mathbf{x}}\ell(\mathbf{y},\mathbf{x},\bm{\theta})\big\Vert_{2}^{2}\Big]}\,. \end{equation}
Proof
Using (Chafaı̈, 2004)’s Corollary 2.1 on \(\phi(f)=- \ln f\) and \(f_{\bm{\theta}}(y,\mathbf{x})=e^{-\lambda \ell(\mathbf{y}, \mathbf{x},\bm{\theta})}\); it verifies that \begin{equation} J_{\bm{\theta}}(\lambda) \leq M\lambda^2\ \mathbb{E}_\nu\Big[\big|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})\big|_{2}^{2}\Big]\,. \end{equation} Then, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) = \inf_{\lambda > 0} \frac{s + J_{\bm{\theta}}(\lambda)}{\lambda} \leq \inf_{\lambda > 0} \frac{s + M\lambda^2\mathbb{E}_\nu[|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})|^2]}{\lambda}\,. \end{equation} Deriving w.r.t. \(\lambda\) to compute the optimal value, gives \(\lambda_{inf} = \sqrt{\frac{s}{M \mathbb{E}_\nu[|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})|^2]}}\). Using this, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) \leq \sqrt{s}\sqrt{M \mathbb{E}_\nu[|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})|^2]}\,. \end{equation} □
The above result explains why models with a small input-gradient norm tend to have a small generalization error. And why a small generalization error may coexist with a large input-gradient norm if the gap in the above inequality is large.
From the above result, it is straightforward to derive a connection between the generalization of a model and its Lipschitz constant. If a model \(\bm{\theta}\) is Lipschitz with respect to the input data \(\mathbf{x}\) with Lipschitz constant denoted \(\text{Lip}(\bm{\theta})\), then, by definition, the following inequality is verified: \begin{equation} \label{eq:Lip} \forall (\mathbf{x},\mathbf{y})\in\mathrm{supp}(\nu)\quad \|\nabla_{\mathbf{x}} \ell(\mathbf{y},\mathbf{x},\bm{\theta})\|_2^2\leq \text{Lip}(\bm{\theta})\,. \end{equation} The above inequality can be combined with Proposition 5.30 to derive a new upper bound on the inverse of the rate function and on the generalization error of a model: \begin{equation} \textstyle \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \sqrt{\tfrac{1}{n}\log\tfrac{k^p}{\delta}}\sqrt{M\text{Lip}(\bm{\theta})}\,. \end{equation} In consequence, models with a small Lipschitz constant will tend to have a small generalization error. This aligns fully with the existing literature (Neyshabur et al., 2017b) and, also, explains the rationale behind regularization techniques that prevent overfitting by controlling the Lipschitz constant (Gouk et al., 2021).
Summary.
In this section, an optimal (distribution-dependent) regularizer for interpolators has been theoretically characterized, even for over-parameterized model classes. This theoretical characterization can be used to understand why a wide range of commonly used regularization techniques promotes interpolators with a smaller generalization error.
In some specific but not less relevant situations, it has been analyzed when some of these common regularization methods could be optimal for interpolators; even in over-parameterized model classes. In this regard, this thesis complements other related theoretical results, like the ones proposed by (Bartlett et al., 2020), where the Euclidean norm was shown to be an optimal regularizer for linear regression interpolators under some specific assumptions.
This section draws connections between the proposed definition of smoothness, as outlined in Section 5.3.4, and other existing definitions in the literature. The smoothest interpolator is shown to have near-optimal performance in Proposition 5.26. The clearest connection with existing measures of smoothness arises from the characterization of the Lipschitz constant, as demonstrated by combining Equation \(\eqref{eq:Lip}\) and Proposition 5.30. The Lipschitz constant is frequently cited in the literature as a measure of a model’s smoothness (Gouk et al., 2021). In this thesis, it has been shown how models with a small Lipschitz constant have a small inverse rate and, in consequence, are smoother according to the proposed definition.
Invariances
In many machine learning settings, the observed inputs \(\mathbf{x}\) are transformed versions of underlying signals, typically as a consequence of the measurement process. For instance, sensor readings may be corrupted by random noise from imperfect hardware, and images may be rotated, blurred, or otherwise distorted during acquisition (Bronstein et al., 2021). This section interprets two widely used strategies—data augmentation (Shorten & Khoshgoftaar, 2019) and invariant architectures (Bronstein et al., 2021)—through the lens of PAC–Chernoff bounds and the rate function, viewing them as complementary mechanisms for improving learning under transformed inputs.
Let \(G\) denote a set of input-data transformations such that any \(g \in G\) denotes a function, \(g:{\cal X}\rightarrow{\cal X}\), defining a specific transformation. Assume there exists an unknown distribution \(h\) over \(G\). In image classification, \(G\) could be the set of all possible rotations, translations and/or reflections.
The central hypothesis in this section is that the generative process of the data follows the framework defined in the previous paragraphs.
Assumption 5.31. The data-generating distribution has the following structure: \begin{equation} \begin{aligned} (i) \,\, \mathbf{x}_0 \sim \nu_0(\mathbf{x}_0), \quad (ii) \,\,y \sim \nu(\mathbf{y}|\mathbf{x}_0), \quad (iii) \,\,g \sim h(g), \quad (iv)\,\, \mathbf{x} = g(\mathbf{x}_0)\,, \end{aligned} \end{equation} where \(\mathbf{x}_0\) denotes the untransformed and unobserved input, which is distributed according to \(\nu_0\), \(\mathbf{x}\) denotes the observed (transformed) input. See Figure 5.15 (left) for a graph representation.
Under this assumption, the target value of an input is sampled before its transformation, that is, transforming the inputs does not affect its label. In statistical terms, under Assumption 5.31, the target random variable \(\mathbf{y}\) is conditionally independent of \(\mathbf{x}\) given \(\mathbf{x}_0\). This fact perfectly aligns with many common settings, where, for example, rotating, reflecting or cropping an image does not alter the label of the object being represented in it (Bronstein et al., 2021). This assumption is weaker than the usual ones used in many related works (Bronstein et al., 2021; Chen et al., 2020), where the conditional generating distribution is assumed to be invariant under any transformation. These works assume that \(\forall g\in G\), \(\nu(\mathbf{y}|\mathbf{x})=\nu(\mathbf{y}|g(\mathbf{x}))\). This assumption does not hold, for example, when considering rotation over digits, as the true label of the image of a six rotated 180 degrees changes to a nine.
Consider the generalized scenario where several transformations can be applied, that is, \(g \in G\) is the composition of other transformations \(g(\mathbf{x}_0) = g_T \circ \cdots \circ g_1(\mathbf{x}_0)=\mathbf{x}\). Let \(G_1, \dots, G_T\) denote the sets of all possible individual transformations, with \(G=\cup_t G_t\). Notice that the value of \(T\) is arbitrary and can differ from one transformation \(g \in G\) to another; however, to simplify the notation, the same \(T\) will be used consistently. where one may consider that some \(g_i \in G_i\) are the identity. Then, consider the non-fully-transformed inputs \(\mathbf{x}_t = g_t\circ\cdots \circ g_1(\mathbf{x}_0)\), with \(0<t<T\). The following result shows that the mutual information between the targets and the inputs, denoted \(MI(\mathbf{y}\, ; \mathbf{x})\), decreases the further the input is transformed.
Proposition 5.32. Under Assumption 5.31, let \(\mathbf{x}_t\) denote the random variable denoting a transformation of an input \(\mathbf{x}_0\) using a composition of \(t\) transformations, then \begin{equation} MI(\mathbf{y}\, ; \mathbf{x}_{t})\leq MI(\mathbf{y}\, ; \mathbf{x}_{t-1}) \quad \forall t \in \mathbb{N}^+\,. \end{equation} As a consequence, \(MI(\mathbf{y}; \mathbf{x}) = MI(\mathbf{y}; \mathbf{x}_T) \leq MI(\mathbf{y}; \mathbf{x}_0)\).
Proof
According to Assumption 5.31, the targets \(\mathbf{y}\) are independent of the input \(\mathbf{x}_t\) given \(\mathbf{x}_{t-1}\), that is, \(\mathbf{y} \perp \mathbf{x}_t|\mathbf{x}_{t-1}\). This is a result of the fact that \(\mathbf{x}_{t-1}\) d-connects \(\mathbf{x}_t\) and \(\mathbf{y}\) in the graph representation of Assumption 5.31. The result follows then by the data processing inequality. □
The rate function and PAC-Chernoff bounds can be directly applied to explain why learning on transformed data \(\mathbf{x}\) makes interpolators have worse generalization error compared to “ideally” learning on the untransformed data \(\mathbf{x}_0\), which is never observed. In the following sections, this argument is used to explain the reason why invariant architectures and data augmentation promote models with better generalization errors by exploiting the fact that the inputs are transformed.
Open Question 5.8. Why do models interpolating transformed data, as described in Assumption 5.31, have a higher generalization error?
For a fixed data-generating distribution, the law of \(\hat{L}(D,\bm{\theta})\) governs the generalization error of a model \(\bm{\theta}\in\bm{\Theta}\). Proposition 5.32 shows that learning on transformed inputs reduces the information that \(\mathbf{x}\) carries about its target \(\mathbf{y}\). Consequently, under transformed inputs, \(\hat{L}(D,\bm{\theta})\) becomes less concentrated and its expectation increases. Figure 5.16 illustrates this phenomenon for a multi-layer perceptron and an Inception network on CIFAR-10. The left panel displays the distributions of \(\hat{L}(D_0,\bm{\theta})\), \(\hat{L}(D_1,\bm{\theta})\), and \(\hat{L}(D_2,\bm{\theta})\) for datasets of size \(50\), where \(D_0\) is untransformed, \(D_1\) applies random translations to \(D_0\), and \(D_2\) further applies random rotations to \(D_1\); these correspond to draws from \(\nu_0,\nu_1,\) and \(\nu_2\), respectively. Transforming the inputs once (translations) and twice (translations + rotations) increases both the mean loss and its dispersion, visible in the widening box-plots. As shown in the center and right panels, the reduced concentration of \(\hat{L}(\cdot,\bm{\theta})\) under transformed inputs is mirrored by smaller rate functions. Note that a fixed model can exhibit different rate functions across data-generating distributions, since the rate function depends on both the model and the distribution. The figure also indicates that the adverse effect of input transformations can be partially mitigated by invariant architectures: for example, the Inception network’s convolutional layers confer translation invariance (improving behavior under \(\nu_1\)) but not rotation invariance (leaving performance under \(\nu_2\) relatively degraded). The next subsection elaborates on this point.
Loss distribution
Rate functions MLP
Rate functions Inception
The changes observed in the distribution of the empirical loss \(\hat{L}(\cdot,\bm{\theta})\) under transformed inputs can be formalized within the present framework by introducing suitable assumptions. Although these assumptions are not generally verifiable in practice, the obtained results provide theoretical justification for why input transformations make the distribution of \(\hat{L}(\cdot,\bm{\theta})\) less concentrated and increase its expectation. Formally, let \(\nu_t\) denote the marginal distribution of inputs transformed \(t\) times, \(\mathbf{x}_t = g_t \circ \cdots \circ g_1(\mathbf{x}_0)\), and write \(L^{\nu_t}(\bm{\theta})\) and \(L^{\nu_{t+1}}(\bm{\theta})\) for the expected loss of model \(\bm{\theta}\) under \(\nu_t\) and \(\nu_{t+1}\), respectively. Likewise, let \({\cal I}_{\bm{\theta}}^{\nu_t}(a)\) and \({\cal I}_{\bm{\theta}}^{\nu_{t+1}}(a)\) denote the corresponding rate functions.
Proposition 5.33. Under Assumption 5.31, if \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is convex under \(\mathbf{x}\) and \(\mathbb{E}_{g_t}[g_t(\mathbf{x})]=\mathbf{x}\), then, \(\forall\bm{\theta}\in \bm{\Theta},\ L^{\nu_{t+1}}(\bm{\theta})\geq L^{\nu_t}(\bm{\theta})\).
Proof
By definition, \(L^{\nu_{t+1}}(\bm{\theta})= \mathbb{E}_{\nu_{t+1}}[\ell(\mathbf{y},\mathbf{x}_{t+1}),\bm{\theta})] = \mathbb{E}_{\nu_t}\mathbb{E}_{g_{t+1}}[\ell(\mathbf{y},g_{t+1}(\mathbf{x}_t),\bm{\theta})]\). As \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is convex in \(\mathbf{x}\), the expectation under \(g_{t+1}\) can be moved inside \(\ell\): \begin{equation} L^{\nu_{t+1}}(\bm{\theta}) \geq \mathbb{E}_{\nu_t} [\ell(\mathbf{y},\mathbb{E}_{g_{t+1}}[g_{t+1}(\mathbf{x}_t)],\bm{\theta})]\,. \end{equation} Given that \(\mathbb{E}_{g_{t+1}}[g_{t+1}(\mathbf{x})]=\mathbf{x}\), it verifies that \(L^{\nu_{t+1}}(\bm{\theta}) \geq \mathbb{E}_{\nu_t}[\ell(\mathbf{y}, \mathbf{x}_t,\bm{\theta})] = L^{\nu_{t}}(\bm{\theta})\). □
The next result explains why the empirical loss becomes less concentrated when inputs are transformed as in Assumption 5.31. For a fixed datum \(\mathbf{x}\), a model incurs loss \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\); after applying a random transformation \(g\sim h\) to the same input, the incurred loss is \(\ell(\mathbf{y},g(\mathbf{x}),\bm{\theta})\), which may be larger or smaller than the original value. Define the relative change \begin{equation} \Delta(\mathbf{y},\mathbf{x},g,\bm{\theta}) := \ell(\mathbf{y},g(\mathbf{x}),\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}). \end{equation} If \(\Delta(\mathbf{y},\mathbf{x},g,\bm{\theta})\) is statistically independent of the baseline loss \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\), then the empirical loss under transformed inputs is less concentrated. Intuitively, the magnitude of the loss change induced by the transformation does not depend on the original (untransformed) loss, thereby increasing variability in the aggregate.
Proposition 5.34. Under Assumption 5.31, if the change observed in the loss after a transformation \(\Delta(\mathbf{y},\mathbf{x},g,\bm{\theta}) = \ell(\mathbf{y},g(\mathbf{x}),\bm{\theta}) - \ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is statistically independent of the loss itself, \(\Delta(\mathbf{y},\mathbf{x},g,\bm{\theta})\perp \ell(\mathbf{y},\mathbf{x},\bm{\theta})\), then \begin{equation} {\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a)\leq {\cal I}^{\nu_{t}}_{\bm{\theta}}(a)\quad \forall a>0\,. \end{equation}
Proof
By definition, it verifies that \(J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t+1}}[e^{-\ell(\mathbf{y},\mathbf{x}_{t+1},\bm{\theta})}]\), where \(\nu_{t+1}\) can be expanded leading to \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}}\left[e^{-\lambda \ell(\mathbf{y},g_{t+1}(\mathbf{x}_{t}),\bm{\theta})}\right]\,. \end{equation} Using the decomposition of the loss, \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}}\left[e^{-\lambda (\ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})+\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta}))}\right]\,, \end{equation} which can be written as \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\left[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}\right]+\ln \mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}}\left[e^{-\lambda\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta})}\right]. \end{equation} Using Jensen’s inequality in the last term, \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) \geq \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\left[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}\right]-\lambda\mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}} [\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta})]\,, \end{equation} and using the definition of \(\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta})\), \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) \geq \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\left[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}\right]-\lambda\mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}} [\ell(\mathbf{y},g_{t+1}(\mathbf{x}_t),\bm{\theta})] + \lambda\mathbb{E}_{\nu_{t}}[\ell(\mathbf{y},\mathbf{x}_t,\bm{\theta})]]\,. \end{equation} By definition of the expected loss, \begin{align} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) &\geq \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}]-\lambda L^{\nu_{t+1}}(\bm{\theta}) + \lambda L^{\nu_{t}}(\bm{\theta})\\ &= \ln \mathbb{E}_{\nu_{t}}[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}]+\lambda L^{\nu_{t}}(\bm{\theta}) = J^{\nu_{t}}_{\bm{\theta}}(\lambda)\,. \end{align} From this inequality and by considering standard properties of the Legendre transform, it verifies that \(J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) \geq J^{\nu_{t}}_{\bm{\theta}}(\lambda)\), which implies that \({\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a)\leq {\cal I}^{\nu_{t}}_{\bm{\theta}}(a)\). □
Summarizing, when input-data is transformed (Assumption 5.31), the input-data provides less information about the target variable (Proposition 5.32). This causes the distribution of the empirical loss \(\hat{L}(\cdot, \bm{\theta})\) to be less concentrated and with higher expected loss \(L(\bm{\theta})\) (as empirically shown in Figure 5.16 and theoretically argued in Propositions 5.33 and Proposition 5.34). Transformed input-data makes the expected loss of the model \(L(\bm{\theta})\) higher and the distribution of \(\hat{L}(\cdot, \bm{\theta})\) less concentrated.
The distribution-dependent bound given in Theorem 5.17 can be extended to the notation used in this section. With high probability over random draws \(D_t\sim\nu_t^n\), simultaneously for all \(\bm{\theta}\in\bm{\Theta}\), then: \begin{equation} \label{eq:chernoffbound:transformeddata} \text{if $\hat{L}(D_t,\bm{\theta})\leq \epsilon$ \ then \ \ } \big({\cal I}^{\nu_t}_{\bm{\theta}}\big)^{-1}(\tfrac{1}{n}\log\tfrac{k^p}{\delta})\leq L^{\nu_t}(\bm{\theta}) \leq \big({\cal I}^{\nu_t}_{\bm{\theta}}\big)^{-1}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) + \epsilon\,. \end{equation}
The bound above clarifies why transformed inputs lead interpolators to exhibit larger generalization error. Transformations reduce the concentration of \(\hat{L}(D_t,\bm{\theta})\) and thereby decrease the rate function (see Figure 5.16 and Proposition 5.34). Consequently, the inverse rate function increases, and so does \(({\cal I}^{\nu_t}_{\bm{\theta}})^{-1}\bigl(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\bigr)\). Since \(\epsilon\) is assumed to be very small, the generalization error of the interpolator increases accordingly. In summary, the PAC–Chernoff bound explains why interpolators trained on transformed inputs incur higher generalization error.
Invariant Architectures
Invariant architectures refer to a kind of neural networks that remain unaffected by certain transformations in the input data (Bronstein et al., 2021). In the case of images, these transformations might include scaling, rotation, translation, or other geometrical changes.
Definition 5.35. Given a set of transformations \(G\), a model \(\bm{\theta} \in \bm{\Theta}\) is said \(G\)-invariant if \begin{equation} P(\mathbf{y}|\mathbf{x}, \bm{\theta}) = P(\mathbf{y}|g(\mathbf{x}), \bm{\theta}) \quad \forall g \in G\ \ \forall(\mathbf{x}, \mathbf{y}) \in \mathcal{X} \times \mathcal{Y}\,. \end{equation}
Invariant architectures—such as convolutional neural networks (CNNs) (LeCun et al., 1998)—are crucial because they recognize and process patterns independently of position, scale, or orientation. This property is especially important in applications like image and speech recognition, where the salient structure must be captured despite such variations. Empirically, invariant architectures have consistently outperformed non-invariant counterparts (Goodfellow et al., 2016).
The generalization behavior of invariant neural networks has been extensively studied. Several bounds include complexity terms that shrink when the model architecture is invariant (Behboodi et al., 2022; Elesedy, 2022; Sokolic et al., 2017). While these results offer theoretical support for improved generalization with invariance, they primarily establish reductions in complexity surrogates rather than demonstrably tighter bounds on the generalization error itself. Moreover, as argued in Section 5.3, there is accumulating empirical (Zhang et al., 2017) and theoretical evidence (Gastpar et al., 2024; Nagarajan & Kolter, 2019b; Wang et al., 2024) that many such bounds are non-tight—or even vacuous—and can violate basic desiderata (e.g., failing to decrease with larger sample sizes, as in (Behboodi et al., 2022)). Consequently, to the best of current knowledge, there is no widely accepted, specific generalization bound that comprehensively captures the effect of architectural invariance on an interpolator’s generalization error, particularly in over-parameterized regimes. This raises the following open question:
Open Question 5.9. Why do (over-parameterized) invariant interpolators generalize better?
While a model’s final generalization error is influenced by additional factors (e.g., optimization dynamics), a partial answer to the foregoing open question can be obtained by analyzing the rate function of invariant architectures. Under transformed inputs (Assumption 5.31), the key advantage of invariant models is that they are unaffected by input transformations and thus evade the issues identified in the preceding subsection. This observation can be formalized in the following result.
Proposition 5.36. Under Assumption 5.31, if a model \(\bm{\theta} \in \bm{\Theta}\) is \(G_t\)-invariant then, \begin{equation} L^{\nu_{t+1}}(\bm{\theta}) = L^{\nu_t}(\bm{\theta}) \quad \text{and} \quad {\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a) = {\cal I}^{\nu_{t}}_{\bm{\theta}}(a)\quad \forall a>0\,. \end{equation}
Proof
It immediately follows from the definition of model invariant (Definition 5.35) and the definitions of \(L^{\nu_{t+1}}(\bm{\theta})\), \(L^{\nu_t}(\bm{\theta})\), \({\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a)\) and \({\cal I}^{\nu_{t}}_{\bm{\theta}}\). □
As argued above, transforming the inputs makes the empirical loss \(\hat{L}(\cdot,\bm{\theta})\) less concentrated (i.e., with a smaller rate function) and increases its expectation. The preceding result shows that invariant architectures circumvent this issue: for such models, the distribution of the empirical loss is unaffected by transformed inputs.
Figure 5.16 illustrates this effect. Recall that \(D_0 \sim \nu_0^{50}\) denotes the original input distribution, and \(D_1 \sim \nu_1^{50}\) the distribution obtained by applying random translations to \(D_0\). As expected, for a fixed MLP—whose architecture is not translation invariant—the expected loss increases markedly and the rate function decreases substantially when moving from \(D_0\) to \(D_1\). By contrast, for a fixed Inception model, the changes are minimal: the model does not suffer the pronounced degradation observed for the MLP. The small residual difference between results on \(D_0\) and \(D_1\) for Inception, despite Proposition 5.36, arises because the experimental translations were implemented via padding (e.g., zero-padding), which introduces slight output differences even for architectures with convolution and max-pooling.
| Model | Train Acc. | Test Acc. | Test NLL | |
|---|---|---|---|---|
| Inception | \(100.0\%\) | \(74.08\%\) | \(1.00\) | |
| Inception-Shuffle | \(100.0\%\) | \(42.46\%\) | \(2.45\) | |
| MLP | \(99.99\%\) | \(51.69\%\) | \(3.29\) | |
| MLP-Shuffle | \(99.99\%\) | \(51.12\%\) | \(3.29\) | |
| Initial Inception | \(10.00\%\) | \(10.00\%\) | \(2.30\) | |
| Initial MLP | \(10.00\%\) | \(9.96\%\) | \(2.30\) |
The PAC–Chernoff bound in Equation \(\eqref{eq:chernoffbound:transformeddata}\) captures this phenomenon precisely. As argued above, the presence of transformed inputs implies \begin{equation} \bigl({\cal I}^{\nu_1}_{\bm{\theta}}\bigr)^{-1}\!\left(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\right) \;\ge\; \bigl({\cal I}^{\nu_0}_{\bm{\theta}}\bigr)^{-1}\!\left(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\right). \end{equation} However, if the model \(\bm{\theta}\) is invariant to the transformations present under \(\nu_1\), then, by Proposition 5.36, the complexity term in the bound does not increase. Combined with the fact that the distribution-dependent PAC–Chernoff bound is tight for interpolators, this provides, to the best of current knowledge, the most compelling theoretical explanation for why invariant models that interpolate the training data achieve lower generalization error. The PAC-Chernoff bound explains why invariant interpolators have smaller generalization errors than non-invariant interpolators under transformed inputs.
Figure 5.17 illustrates this behavior from a complementary perspective. Although the full set of invariances of an Inception network is not precisely known, convolutional layers confer at least translation invariance. To ablate this property, the experiment applies a fixed random permutation of image pixels—interpretable as an additional input transformation \(g_{T+1}\) composed on top of the unknown sequence of transformations that generate CIFAR-10, \(\nu_T = g_T\circ\cdots\circ g_1(\nu_0)\). Under the resulting distribution \(\nu_{T+1}\), a convolutional architecture such as Inception ceases to be translation invariant. According to the analysis at the beginning of this section, the empirical loss should become less concentrated under \(\nu_{T+1}\), and the rate function should therefore decrease relative to \(\nu_T\). This effect is visible in the figure: the learned Inception models exhibit larger rate functions under \(\nu_T\) (label Inception) than under \(\nu_{T+1}\) (label Inception–Shuffle). Rate functions are computed with respect to the log-loss or negative log-likelihood (NLL).
The same procedure is applied to an MLP, which possesses no local invariances. As expected, its rate function is essentially unaffected by the pixel permutation. Although a given MLP is not invariant to shuffling, the architecture is equivariant in the sense that, for any MLP \(\bm{\theta}\) and permutation \(g_{T+1}\), there exists another MLP \(\bm{\theta}'\) (obtained by permuting the first-layer weights) that yields the same predictions on \(\nu_{T+1}\) as \(\bm{\theta}\) does on \(\nu_T\). Consequently, a learning algorithm can recover “the same” MLP under \(\nu_T\) and \(\nu_{T+1}\), leading to (nearly) identical rate functions in the experiment. The accompanying table further shows that non-invariant models incur substantially higher generalization error, in agreement with the theoretical predictions.
Data Augmentation
Data augmentation is a widely-used technique in machine learning that enhances the robustness and generalization ability of models, particularly in the field of computer vision (Shorten & Khoshgoftaar, 2019). By applying various transformations like rotations, reflections, scaling, and translations to existing data samples, data augmentation artificially expands the dataset. This process helps the learned models to become invariant or less sensitive to these transformations, leading to improved performance on unseen data.
Recent research has delved deeply into the theoretical foundations of data augmentation (DA), with significant contributions coming from studies such as (Chen et al., 2020) and (Lyle et al., 2020). These works build on the assumption that the set of transformations \(G\) forms a group and that these transformations do not alter the data-generating distribution, also assuming an equality in distribution assumption. Within this context, these studies illustrate how DA can decrease variance and improve bounds on generalization error. (Lyle et al., 2020) further presents a PAC-Bayes bound based on these premises, showing that data augmentation can lower the bound’s complexity measure. However, these studies encounter a critical challenge: the equality in distribution assumption does not hold in reality, as it implies that any model \(\bm{\theta}\) would yield identical expected losses on both transformed and non-transformed input data, a premise contradicted by evidence shown in Figure 5.16, where two models exhibit deteriorating performance as the data is increasingly transformed. Additionally, the PAC-Bayes bound introduced by (Lyle et al., 2020) is not proven to be tight for interpolators. Thus, as highlighted in the introduction, merely demonstrating that data augmentation reduces the complexity term of a bound, which might be vacuous, does not assure a lower generalization error for an interpolator.
In this section, a more nuanced account is provided within the proposed framework. First, it is shown that data augmentation induces a more concentrated empirical loss, aligning with prior observations that augmentation reduces the variance of empirical risk (Chen et al., 2020; Lyle et al., 2020). Second, it is established that the complexity term in the PAC–Chernoff bound of Theorem 5.17 decreases under augmentation. Because this bound is perfectly tight for interpolators trained on augmented data, it follows that data augmentation reduces the generalization error of interpolators—by contrast with earlier analyses based on non-tight bounds (Lyle et al., 2020).
As argued, for example, by (Chen et al., 2020), data augmentation is equivalent to optimizing, using Monte-Carlo estimates, the so-called data-augmented loss, denoted as \(\ell_{G}\), which is the result of averaging the standard loss over all transformations obtained from a given input \(\mathbf{x}\), \begin{equation} \label{eq:DA:loss} \ell_{G}(\mathbf{y}, \mathbf{x}, \bm{\theta}) = \mathbb{E}_{g\sim h}\left[ \ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta})\right]\,. \end{equation} A natural question is whether the augmented loss \(\ell_{G}\) yields a more concentrated empirical loss (i.e., a larger rate function), since greater concentration implies reduced generalization error for interpolators. The answer is nuanced. To streamline the discussion, the principal points are summarized below:
Each transformed dataset \(D_T\) corresponds to an underlying, unaltered dataset \(D_0\), from which \(D_T\) is obtained via random transformations \(g_t\sim h_t\) for \(t=1,\ldots,T-1\). Two empirical losses are considered:
The conventional empirical loss on transformed inputs, denoted \(\hat{L}^{\ell}(D_T,\bm{\theta})\) to emphasize that \(D_T\sim \nu_T^n\) and the standard loss \(\ell\) is used.
The augmented empirical loss on the original, untransformed inputs, denoted \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\), where \(D_0\sim \nu_0^n\) and \begin{equation} \hat{L}^{\ell_G}(D_0,\bm{\theta}) \;=\; \frac{1}{n}\sum_{(\mathbf{x}_i,\mathbf{y}_i)\in D_0} \ell_G(\mathbf{y}_i,\mathbf{x}_i,\bm{\theta}), \end{equation} with \(\ell_G\) defined in Equation \(\eqref{eq:DA:loss}\).
Note that \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\) is unobserved because \(D_0\) is not available, yet it plays a central role in explaining the mechanics of data augmentation.
Theorem 5.37 shows—via the rate function—that the augmented empirical loss on untransformed inputs, \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\), is more concentrated than the conventional empirical loss on transformed inputs, \(\hat{L}^{\ell}(D_T,\bm{\theta})\). Since their expectations (population risks) coincide, minimizing \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\) constitutes a better optimization objective: by Chernoff’s bound (Theorem 5.14), lower generalization errors are more likely.
In practice, \(D_0\) is unavailable, and data augmentation optimizes \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) rather than \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\). Therefore, it is necessary to relate \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) to \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\).
Proposition 5.39 establishes that \(\hat{L}^{\ell_G}(D_T,\bm{\theta})=\hat{L}^{\ell_G}(D_0,\bm{\theta})\) when \(G\) is a group and \(h\) is uniform. Consequently, any interpolator of \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) also interpolates \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\), which is more concentrated than \(\hat{L}^{\ell}(D_T,\bm{\theta})\). Many practical augmentation schemes, however, fall outside this group setting (e.g., rotations within a finite range, random cropping).
Proposition 5.40 shows that \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) and \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\) share the same optimal interpolators (i.e., minimizers with zero empirical error). Thus, if \(\bm{\theta}\) perfectly interpolates \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\), then it also perfectly interpolates \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\). Such an interpolator is likely to achieve a lower generalization error than one minimizing \(\hat{L}^{\ell}(D_T,\bm{\theta})\), since \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\) is more concentrated than \(\hat{L}^{\ell}(D_T,\bm{\theta})\). By Chernoff’s bound, an interpolator minimizing \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) is therefore more likely to generalize better.
In what follows, \(L^{\nu_0,\ell_G}(\bm{\theta})\) and \(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}(a)\) denote, respectively, the expected loss and the rate function associated with the augmented loss \(\ell_G\) under the untransformed distribution \(\nu_0\). Under this notation, \begin{equation} L(\bm{\theta}) \;=\; L^{\nu_T,\ell}(\bm{\theta}) \qquad\text{and}\qquad \mathcal{I}_{\bm{\theta}}(a) \;=\; \mathcal{I}^{\nu_T,\ell}_{\bm{\theta}}(a), \end{equation} since these quantities are implicitly defined with respect to the standard loss \(\ell\) and the observed (transformed) distribution \(\nu_T\) from which the training sets are drawn.
Theorem 5.37. Under Assumption 5.31, it verifies that \begin{equation} \forall \bm{\theta}\in\bm{\Theta},\quad L^{\nu_T,\ell}(\bm{\theta}) =L^{\nu_0,\ell_G}(\bm{\theta})\quad and \quad \mathcal{I}^{\nu_T,\ell}_{\bm{\theta}}(a)\leq \mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}(a) \,\,\, \forall a>0\,. \end{equation}
Proof
Under Assumption 5.31, it is clear that \begin{equation} L^{v_0, \ell_G}(\bm{\theta}) = \mathbb{E}_{\nu_0}[\ell_G(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = \mathbb{E}_{\nu_0}\mathbb{E}_{g\sim h}[\ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta}))] = \mathbb{E}_{\nu}[\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = L^{\nu_T, \ell}(\bm{\theta})\,. \end{equation}
On the other hand, from the definition of the cummulant function, \(J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu_T}[e^{-\lambda\ell(\mathbf{y},\mathbf{x}, \bm{\theta})}] + \lambda\mathbb{E}_{\nu_T}[\ell (\mathbf{y},\mathbf{x}, \bm{\theta})]\). Where by definition \begin{equation} J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu_0} \mathbb{E}_{g\sim h}[e^{-\lambda\ell(\mathbf{y},g(\mathbf{x}), \bm{\theta})}] + \lambda\mathbb{E}_{\nu_0} \mathbb{E}_{g\sim h}[\ell (\mathbf{y},g(\mathbf{x}), \bm{\theta})]\,, \end{equation} where expectations can be exchanged as \begin{equation} J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\mathbb{E}_{g}[e^{-\lambda\ell(\mathbf{y},g(\mathbf{x}), \bm{\theta})}] + \lambda\mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\mathbb{E}_{g}[\ell (\mathbf{y},g(\mathbf{x}), \bm{\theta})]\,. \end{equation} Applying Jensen’s inequality to the exponential, \begin{equation} J_{\bm{\theta}}(\lambda) \geq \ln \mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\left[e^{-\lambda\mathbb{E}_{g}\left[\ell(\mathbf{y},g(\mathbf{x}), \bm{\theta})\right]}\right] + \lambda\mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\mathbb{E}_{g}[\ell (\mathbf{y},g(\mathbf{x}), \bm{\theta})]\,. \end{equation} Where by definition of \(\ell_G\), \begin{equation} J_{\bm{\theta}}(\lambda) \geq \ln \mathbb{E}_{\nu_0}[e^{-\lambda[\ell_{G}(\mathbf{y},\mathbf{x}, \bm{\theta})]}] + \lambda\mathbb{E}_{\nu_0}[\ell_{G} (\mathbf{y},\mathbf{x}, \bm{\theta})] = J^{\nu_0, \ell_G}_{\bm{\theta}}(\lambda)\,. \end{equation} From this point, the inequality regarding the inverse rate is clear from the definition. □
Figure 5.18 illustrates these findings for a fixed MLP and a fixed Inception model on CIFAR-10. The left panel shows that, for both architectures, the distribution of the empirical loss on transformed data, \(\hat{L}^{\ell}(D_1,\bm{\theta})\) with \(D_1\sim \nu^{50}_1\), is more dispersed than the distribution on untransformed data, \(\hat{L}^{\ell}(D_0,\bm{\theta})\) with \(D_0\sim \nu^{50}_0\). It also shows that the augmented empirical loss on untransformed data, \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\), is more concentrated while sharing the same expectation as \(\hat{L}^{\ell}(D_1,\bm{\theta})\). The center panel further demonstrates that the rate functions of the Inception models across the three setups vary in the anticipated manner. Note that the models are fixed throughout; changes in the empirical-loss distributions arise solely from differences in the data-generating distribution and/or the loss function.
Loss distribution
Inception
Inception DA
The following result is an adaptation of the PAC-Chernoff bound given in Theorem 5.17 for this setup, which describes the effect of using the data-augmented loss on the generalization error of interpolators.
Corollary 5.38. With h.p. \(1 - \delta\) over \(D_0 \sim \nu^n_0\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} \text{if } \hat{L}^{\ell_G}(D_0, \bm{\theta}) \leq \epsilon \quad\text{then}\quad \big(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}\big)^{-1}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq L^{\nu_T,\ell}(\bm{\theta}) \leq \big(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}\big)^{-1}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}) + \epsilon\,. \end{equation}
Proof
Under Assumption 5.31, it is clear that \begin{equation} L^{v_0, \ell_G}(\bm{\theta}) = \mathbb{E}_{\nu_0}[\ell_G(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = \mathbb{E}_{\nu_0}\mathbb{E}_{g\sim h}[\ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta}))] = \mathbb{E}_{\nu}[\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = L^{\nu_T, \ell}(\bm{\theta})\,. \end{equation} Where it verifies that \(L(\bm{\theta}) := L^{\nu_T, \ell}(\bm{\theta})\). Thus, the result comes from the application of Theorem 5.17. □
Reformulating the bound for interpolators under the standard loss yields \begin{equation} \text{if }\ \hat{L}^{\ell}(D_T,\bm{\theta}) \le \epsilon,\ \text{ then }\ \bigl(\mathcal{I}^{\nu_T,\ell}_{\bm{\theta}}\bigr)^{-1}\!\left(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\right) \le L^{\nu_T,\ell}(\bm{\theta}) \le \bigl(\mathcal{I}^{\nu_T,\ell}_{\bm{\theta}}\bigr)^{-1}\left(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\right) + \epsilon. \end{equation} Together with Theorem 5.37, which asserts that \begin{equation} \bigl(\mathcal{I}^{\nu_T,\ell}_{\bm{\theta}}\bigr)^{-1}(s)\ \ge\ \bigl(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}\bigr)^{-1}(s)\quad \forall\, s>0, \end{equation} it follows that an interpolator trained with the augmented loss attains a smaller generalization error, and hence better generalization. In particular, minimizing \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\) is preferable to minimizing \(\hat{L}^{\ell}(D_T,\bm{\theta})\), since (by Theorem 5.37) the augmented loss induces a more concentrated empirical loss and a smaller inverse rate function; consequently, by Corollary 5.38, the interpolator’s generalization error decreases.
In practice, data augmentation does not minimize \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\) because the unaltered dataset \(D_0\) is unavailable; instead, one minimizes \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\), the augmented loss evaluated on the observed, transformed dataset. If the transformation set \(G\) forms a group and its sampling distribution \(h\) is uniform—assumptions common in data augmentation—then the next result shows that minimizing \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) is effectively equivalent to minimizing \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\).
Proposition 5.39. Under Assumption 5.31, if the transformations \(G\) define a group and its probability distribution \(h\) is uniform, then \begin{equation} \hat{L}^{\ell_{G}}(D_T, \bm{\theta})=\hat{L}^{\ell_{G}}(D_0, \bm{\theta})\,, \end{equation} where \(D_0\) is any un-transformed dataset and \(D_T\) is the corresponding transformed dataset obtained by applying a transformation \(g\sim h\) to each of the samples in \(D_0\).
Proof
On one hand, by definition of \(\ell_G\) it verifies that \begin{align} \hat{L}^{\ell_G}(D_T, \bm{\theta}) &= \frac{1}{n}\sum_i \mathbb{E}_{g' \sim h}[\ell(\mathbf{y}_i, g'(\mathbf{x}_i),\bm{\theta})] = \frac{1}{n}\sum_i \int h(g') \ell(\mathbf{y}_i, g'(g_i(x_{0,i})),\bm{\theta})\ dg'\,. \end{align} As the set of transformations define a group, for each transformation \(g_i\) that generated each input, there exists \(g_i^{-1}\). Furthermore, there exists \(l := g' \circ g_i \in G_1\) such that \(g' = l\circ g_i^{-1}\). Thus, \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta}) = \frac{1}{n}\sum_i \int_{g' = l \circ g_i^{-1}} h(l \circ g_i^{-1}) \ell(\mathbf{y}_i,l \circ g_i^{-1}\circ g_i(x_{0,i}),\bm{\theta})\ dg'\,. \end{equation} Then, using a change of variables \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta}) = \frac{1}{n}\sum_i \int_{l} h(l \circ g_i^{-1}) \ell(\mathbf{y}_i,l(x_{0,i}),\bm{\theta})\ dl\,. \end{equation} As \(h\) is uniform \(h(l \circ g_i^{-1}) = h(l)\), \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta}) = \frac{1}{n}\sum_i \int_{l} h(l)\ell(\mathbf{y}_i,l(x_{0,i}),\bm{\theta})dl =\hat{L}^{\ell_G}(D_0, \bm{\theta})\,. \end{equation} □
When multiple transformations \(g_1,\ldots,g_T\) are applied, the analysis extends directly as a sequence of equalities provided that each transformation set \(G_1,\ldots,G_T\) forms a group and is sampled from a uniform distribution \(h_1,\ldots,h_T\). From this perspective, the approach can be combined with the insights on invariant architectures from the previous section: the two techniques can address different subsets of a chain of transformations. For example, convolutional neural networks naturally handle translations, while data augmentation can target other types of transformations.
Theorem 5.37 establishes that, for a dataset \(D_T\sim\nu_T^n\), the augmented empirical loss \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) is more concentrated than the standard empirical loss \(\hat{L}^{\ell}(D_T,\bm{\theta})\). Furthermore, by the Chernoff bound in Corollary 5.38, interpolators trained with the augmented loss exhibit reduced generalization error. To the best of current knowledge, this provides the strongest explanation of why data augmentation improves the generalization of interpolators, achieved via distribution-dependent bounds. In brief: Theorem 5.37 shows that the augmented loss increases concentration of the empirical loss, and Corollary 5.38 then implies a smaller generalization error for interpolators under the augmented loss.
Not all data-augmentation transformations form a group. For instance, rotations restricted to \([-20^\circ,20^\circ]\) do not constitute a group, since composing \(20^\circ\) with \(10^\circ\) yields \(30^\circ\), which lies outside the range. A rotation group would require the full range (e.g., \([-360^\circ,360^\circ]\)), which is rarely used in practice. Similarly, random cropping does not yield a group. In such cases, the analysis can proceed under the approximation \begin{equation} \hat{L}^{\ell_G}(D_T,\bm{\theta}) \approx \hat{L}^{\ell_G}(D_0,\bm{\theta})\,. \end{equation} Consequently, minimizing \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\) effectively also minimizes \(\hat{L}^{\ell_G}(D_0,\bm{\theta})\), which is more concentrated and therefore admits interpolators with smaller generalization error.
The next result provides an alternative treatment for transformations that do not form a group. It requires the existence of inverses within \(G\): for every \(g\in G\) there is \(g^{-1}\in G\). Rotations in \([-20^\circ,20^\circ]\) satisfy this condition because the inverse of any angle \(a\in[-20,20]\) is \(-a\in[-20,20]\). Under this assumption, if \(\bm{\theta}\) is a perfect interpolator for \(\hat{L}^{\ell_G}(D_T,\bm{\theta})\)—the empirical objective optimized in practice—then \(\bm{\theta}\) is also a perfect interpolator for \(\hat{L}^{\ell}(D_0,\bm{\theta})\). Here, “perfect interpolator” denotes a model that attains the minimal achievable empirical loss.
Proposition 5.40. Given an untransformed dataset \(D_0\), for any transformed dataset \(D_T\) derived from \(D_0\) under Assumption 5.31, if \begin{equation} m_{\bm{\theta}}:=\essinf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, g(\mathbf{x}),\bm{\theta}) \quad \forall g \in G\,, \end{equation} and \(\forall g \in G\), \(\exists g^{-1} \in G\). Then, it verifies that \begin{equation} \forall \bm{\theta}\in\bm{\Theta} \quad \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \hat{L}^\ell(D_0, \bm{\theta})= m_{\bm{\theta}}\,. \end{equation}
Proof
First of all, by definition we got that \(m_{\bm{\theta}} = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta})\). Then, it verifies that \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell_G(\mathbf{y}, \mathbf{x}, \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall (\mathbf{x}, \mathbf{y}) \in D_T\,. \end{equation} Thus, by definition of \(\ell_G\): \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \mathbb{E}_{g}[\ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta})] = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall (\mathbf{x}, \mathbf{y}) \in D_T\,. \end{equation} Now, using that \(ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, g(\mathbf{x}),\bm{\theta}) = m_{\bm{\theta}} \ \forall g \in G\), reaching the essential infimum in expectation means all the losses inside the expectation reach such infimum: \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall g\in G, \forall (\mathbf{x}, \mathbf{y}) \in D_T\,. \end{equation} Using that for every \((\mathbf{x}_0, \mathbf{y}) \in D_0\) exists \(g' \in G\) associated to that input such that \(\mathbf{x} = g'(\mathbf{x}_0)\) and that \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell(\mathbf{y}, g\circ g'(\mathbf{x}_0), \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall g\in G, \forall (\mathbf{x}_0, \mathbf{y}) \in D_0\,. \end{equation} Then, using that there exists \(g'^{-1} \in G\), we got that \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell(\mathbf{y}, \mathbf{x}_0, \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall (\mathbf{x}_0, \mathbf{y}) \in D_0\,. \end{equation} As a result, the empirical loss \(\hat{L}^{\ell}(D_0, \bm{\theta})\) is equal to \(m_{\bm{\theta}}\) too. □
This result implies that if a model achieves perfect interpolation on the transformed dataset \(D_T\) using the augmented loss, it concurrently achieves perfect interpolation on the original, untransformed dataset \(D_0\) under the standard loss. This means that when the learning algorithm identifies an interpolator utilizing the augmented loss (as typically occurs in practice), it is essentially also identifying an interpolator for the untransformed dataset \(D_0\). This scenario parallels the one encountered with invariant architectures, where the algorithm retrieves interpolators for the untransformed dataset. As highlighted at the outset of Section 5.3.7, interpolators for the untransformed dataset \(D_0\) are associated with a lower generalization error compared to those for the transformed dataset \(D_T\).
Figure 5.18 (right) illustrates the rate function of an inception model trained with data augmentation achieving almost perfect interpolation, where \(\hat{L}^{\ell_G}(D_T, \bm{\theta})\approx m_{\bm{\theta}}\). It’s particularly noteworthy to observe how this model’s various rate functions are nearly indistinguishable, indicating that the model has nearly become invariant to rotations too. According to Proposition 5.36, if a model is invariant to rotations, then \({\cal I}^{\nu_0,\ell}(a) = {\cal I}^{\nu_1,\ell}(a)\), which aligns with the outcome of this experiment.
Within this framework—unlike prior methodologies—it is possible to unify invariant architectures, data augmentation, and the explicit regularization strategies of Section 5.3.6. In particular, this perspective explains why explicit regularization, when combined with invariance or augmentation, leads to interpolators with smaller generalization error, a relationship already observed empirically in Figure 5.9. The key step is to extend the analysis of Section 5.3.6 to the inverse rate functions \(\bigl({\cal I}^{\nu_0}_{\bm{\theta}}\bigr)^{-1}\!\bigl(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\bigr)\) and \(\bigl(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}\bigr)^{-1}\!\bigl(\tfrac{1}{n}\log\tfrac{k^p}{\delta}\bigr)\), associated with invariant architectures and data augmentation, respectively. For example, applying Proposition 5.27 to these inverse rate functions clarifies the benefit of \(\ell_2\)-norm or distance-from-initialization regularization in the presence of invariance and/or augmentation; likewise, Proposition 5.30 elucidates the advantage of input-gradient normalization in these settings.
Over-Parameterization and Smooth Interpolation
Many theoretical works have established a direct link between over-parameterization and the strong performance of interpolators (Bubeck et al., 2021; Neyshabur et al., 2019). In particular, the study of Bubeck & Sellke (2023), under an isoperimetry assumption, shows that any model capable of interpolating the training data below the noise threshold must possess a (Euclidean) Lipschitz constant on the order of at least \(\sqrt{nd/p}\), where \(d\) is the ambient data dimension. Using the present notation, say that a model \(\bm{\theta}\) interpolates below the noise threshold if \(\hat{L}(D,\bm{\theta}) \le L^\star := \min_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta})\), and let \(Lip(\bm{\theta})\) denote the Lipschitz constant as in Equation \(\eqref{eq:Lip}\). Then Theorem 1 of Bubeck & Sellke (2023) can be informally stated as follows: under an isoperimetry assumption, for any \(\epsilon\in(0,L^\star)\) and \(\delta\in(0,1)\), with probability at least \(1-\delta\) over \(D\sim \nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), \begin{equation} \text{if}\quad \hat{L}(D,\bm{\theta})\le \epsilon \quad \text{then}\quad Lip(\bm{\theta}) \;\ge\; \Omega\Big((L^\star-\epsilon)\sqrt{\tfrac{nd}{p}}\Big). \end{equation} This implies that, to maintain \(O(1)\) Lipschitz constants for models that continue to interpolate as \(n\) grows, the number of parameters \(p\) must also increase. In Bubeck & Sellke (2023), the Lipschitz constant serves as a proxy for smoothness; their findings therefore identify over-parameterization as a key prerequisite for interpolators to achieve small Lipschitz constants and, consequently, improved generalization.
The PAC–Chernoff perspective affords a more refined understanding of the role of over-parameterization in the generalization of interpolators. In fact, the following result shows that over-parameterization emerges naturally from the PAC–Chernoff bound of Theorem 5.17. Specifically, a lower bound on the number of parameters of a model class is derived in terms of the rate function of a model that interpolates below the noise threshold.
Theorem 5.41. For any \(\epsilon\in(0, L^\star)\) and any \(\delta \in (0,1)\), with high probability \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \text{if} \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad \text{then}\quad p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star-\epsilon) + \log\delta}{\log k}\,. \end{equation}
Proof
Due to fact that \(L(D,\bm{\theta})\leq \epsilon\) and to Theorem 5.17, we have that with high probability \(1-\delta\) over \(D\sim\nu^n(y,x)\), \begin{equation} L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} Then, just rearranging terms as follows, we get \(L(\bm{\theta})-\epsilon \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Applying the rate function at both sides, \begin{equation} \mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) \leq \tfrac{1}{n}\ln\tfrac{k^p}{\delta} = \frac{1}{n}\left(\ln k^p - \ln \delta \right)\,. \end{equation} Then, we got \(\ln k^p \geq n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta\) and \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\,. \end{equation} As \(L^\star\leq L(\bm{\theta})\) and the rate function is monotonically increasing, then \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star - \epsilon) + \ln\delta}{\ln k}\,. \end{equation} □
The above bound connects the smoothness of an interpolator, measured by the rate function as discussed in Section 5.3.4, and the minimum number of parameters in the model class. As the size of the training dataset increases, the number of parameters must also increase linearly to maintain the same degree of smoothness in the model. This result generalizes the results of (Bubeck & Sellke, 2023) and it does not require a isoperimetry assumption.
The principal advantage of these findings is that they forge a connection between over-parameterization, interpolation, and the preceding analyses of model smoothness and rate functions. This linkage shows that the smooth interpolation perspective of Bubeck & Sellke (2023)—framed via isoperimetry and Lipschitz continuity—constitutes one among several approximations to the characterization in Theorem 5.41. As discussed in Section 5.3.6, parameter norms, distance from initialization, input-gradient norms, and the Lipschitz constant serve as proxies for the (inverse) rate function and, by extension, for smoothness. In particular, Bubeck & Sellke (2023)’s characterization of smooth interpolation can be recovered directly from Theorem 5.41 under the same isoperimetry assumption; analogous derivations follow under log-concavity by the techniques used in Proposition 5.30 together with Equation \(\eqref{eq:Lip}\).
Corollary 5.42. If \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is Lipschitz w.r.t. \(\mathbf{x}\) with constant \(Lip(\bm{\theta})\) and satisfies a \(c\)-isoperimetry assumption, then for any \(\epsilon\in(0, L^\star)\) and any \(\delta \in (0,1)\), with high probability \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \text{if} \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad \text{then}\quad Lip(\bm{\theta})\geq \sqrt{\tfrac{nd} {2c(p\log k - \log \delta)}}(L^\star - \epsilon) \,. \end{equation}
Proof
Under the isoperimetry assumption, we have \(\mathcal{I}_{\bm{\theta}}(a) \leq \frac{d a^2}{2 c Lip(\bm{\theta})^2 }\). Due to fact that \(L(D,\bm{\theta})\leq \epsilon\) and to Theorem 5.17, we have that with high probability \(1-\delta\) over \(D\sim\nu^n(y,x)\), \begin{equation} L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} Then, just rearranging terms as follows, we get \(L(\bm{\theta})-\epsilon \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Applying the rate function at both sides, \begin{equation} \mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) \leq \tfrac{1}{n}\ln\tfrac{k^p}{\delta} = \frac{1}{n}\left(\ln k^p - \ln \delta \right)\,. \end{equation} Then, we got \(\ln k^p \geq n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta\) and \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\,. \end{equation} As \(L^\star\leq L(\bm{\theta})\) and the rate function is monotonically increasing, then \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star - \epsilon) + \ln\delta}{\ln k}\,. \end{equation} Using the upper bound over the rate function we obtained before, we got that \begin{equation} p \geq \frac{n\frac{d(L^\star - \epsilon)^2}{2cLip(\bm{\theta})^2} + \ln\delta}{\ln k}\,. \end{equation} Re-arranging terms, \begin{equation} \sqrt{\frac{nd(L^\star - \epsilon)^2} {2c(p\ln k - \ln \delta)}} \leq Lip(\bm{\theta})\,. \end{equation} □
Similarly, a bound with a giving norm or distance-from-initialization can de derived.
Corollary 5.43. If the loss function \(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) is Lipschitz w.r.t. \(\bm{\theta}\) with constant \(M > 0\), then, for any \(\bm{\theta}_0 \in \bm{\Theta}_0 = \{\bm{\theta} \in \bm{\Theta} \ | \ \mathbb{V}_{\nu}(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) = 0\} \subset \bm{\Theta}\), any \(\epsilon\in(0, L^\star)\) and any \(\delta \in (0,1)\), with high probability \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \text{if} \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad \text{then}\quad \|\bm{\theta}-\bm{\theta}_0\|_2\geq \sqrt{\tfrac{n}{8M(p\log k - \log \delta)}} (L^\star - \epsilon)\,. \end{equation}
Proof
If the loss is Lipschitz continuous with constant \(M\), \(\forall y,x,\bm{\theta}\quad \|\nabla_{\bm{\theta}} \ell(\mathbf{y},\mathbf{x},\bm{\theta})\|^2_2\leq M\). Then, \(J_{\bm{\theta}}(\lambda)\) verifies \begin{equation} \|\nabla_{\bm{\theta}}J_{\bm{\theta}}(\lambda)\|^2_2=||-\lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] + \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) \right]||_2^2 \leq 2M \lambda^2\,. \end{equation} where \(\mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] = \frac{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})]}{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]}\). With this, we have \begin{equation} |J_{\bm{\theta}}(\lambda) - J_{\bm{\theta}_0}(\lambda)|\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2 \implies J_{\bm{\theta}}(\lambda)\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2\,. \end{equation} Then, for any \(a \geq 0\), we have by definition of the rate function that \begin{equation} \mathcal{I}_{\bm{\theta}}(a) \geq a\lambda - J_{\bm{\theta}}(\lambda) \geq a\lambda-2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2\,. \end{equation} As the inequality holds for any \(\lambda > 0\), maximizing it raises \(\lambda^\star = \frac{a}{4M\|\bm{\theta}-\bm{\theta}_0\|^2_2}\). Thus, \begin{equation} \mathcal{I}_{\bm{\theta}}(a) \geq a\lambda - J_{\bm{\theta}}(\lambda) \geq a\frac{a}{4M\|\bm{\theta}-\bm{\theta}_0\|^2_2}-2M \frac{a^2}{(4M\|\bm{\theta}-\bm{\theta}_0\|^2_2)^2}\|\bm{\theta}-\bm{\theta}_0\|^2_2 = \frac{a^2}{8M\|\bm{\theta}-\bm{\theta}_0\|^2_2}\,. \end{equation} Due to fact that \(L(D,\bm{\theta})\leq \epsilon\) and to Theorem 5.17, we have that with high probability \(1-\delta\) over \(D\sim\nu^n(\mathbf{x}, \mathbf{y})\), \(L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Then, just rearranging terms as follows, we get \(L(\bm{\theta})-\epsilon \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Applying the rate function at both sides, \begin{equation} \mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) \leq \tfrac{1}{n}\ln\tfrac{k^p}{\delta} = \frac{1}{n}\left(\ln k^p - \ln \delta \right)\,. \end{equation} Then, we got \(\ln k^p \geq n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta\) and \(p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\). As \(L^\star\leq L(\bm{\theta})\) and the rate function is monotonically increasing, then \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star - \epsilon) + \ln\delta}{\ln k}\,. \end{equation} Using the upper bound over the rate function we obtained before, we got that \begin{equation} p \geq \frac{n\frac{(L^\star - \epsilon)^2}{8M\|\bm{\theta}-\bm{\theta}_0\|^2_2} + \ln\delta}{\ln k} \implies (L^\star - \epsilon)\sqrt{\frac{n}{8M(p\ln k - \ln \delta)}} \leq \|\bm{\theta}-\bm{\theta}_0\|_2\,. \end{equation} □
Corollary 5.42 and Corollary 5.43 imply that, as the number of samples \(n\) increases, to have smooth interpolators with a small Lipschitz constant, or a small parameter norm, or a small distance from initialization, the number of parameters of the model class must be increased too. In any case, all these results are approximations of the general result given in Theorem 5.41. Note that a similar result could be derived in the context of input-gradient norms using Proposition 5.30. Interpolating with a small parameter norm, or a small distance from initialization, or a small input-gradient norm requires over-parameterization.
Remarkably, Theorem 5.41 can also be connected to invariant architectures and data augmentation (Section 5.3.7). In particular, combining Theorem 5.37—which asserts that the data-augmented loss induces a higher rate function—with Theorem 5.41 implies that interpolation under data augmentation will, beyond a certain point, necessitate a larger number of parameters. An analogous conclusion holds for more invariant architectures, which likewise yield models with higher rate functions. Consequently, interpolating with data augmentation and/or an invariant architecture requires over-parameterization.
To the best of current knowledge, no prior results establish this type of relationship between over-parameterization and smooth interpolators.
Experimental Settings
This appendix elaborates on the experimental setup, detailing the model architectures, hyper-parameters, and training procedures used for each figure.
Learning Settings for Figures
The code accompanying these experiments is available at https://github.com/Ludvins/2024_PAC-Chernoff-Bound. Unless otherwise noted, experiments employ a compact InceptionV3 architecture (Szegedy et al., 2016) following Zhang et al. (2017), trained on the CIFAR-10 dataset (Krizhevsky et al., 2009). Before presenting the exact specification of this “small Inception” model (as adapted from (Zhang et al., 2017)), the constituent modules are described in detail below.
Convolutional module: Convolutional layer, batch-normalization and ReLU activation.
Inception module with output channels \(o_{1\times 1}\) and \(o_{3\times 3}\): Consists on 2 different convolutional layers, one with kernel \(1 \times 1\) and \(o_{1\times 1}\) output channels and another with kernel \(3 \times 3\) and \(o_{3\times 3}\) output channels. The output of this layers is then concatenated, so the total number of output channels is \(o_{1\times 1} + o_{3\times 3}\).
Downsample module: Convolutional module with kernel size \(3\), stride \(2\) and padding \(0\) and MaxPooling with kernel size of \(3\) and stride \(2\). The outputs of these two layers is concatenated.
With these elements, the architecture of small InceptionV3 network is
Convolutional module with \(96\) output channels, kernel size \(3\), stride \(1\) and padding \(0\).
Inception Module with \(o_{1 \times 1} = 32\) and \(o_{3 \times 3} = 32\).
Inception Module with \(o_{1 \times 1} = 32\) and \(o_{3 \times 3} = 48\).
DownSample Module with \(o_{3 \times 3} = 80\).
Inception Module with \(o_{1 \times 1} = 112\) and \(o_{3 \times 3} = 48\).
Inception Module with \(o_{1 \times 1} = 96\) and \(o_{3 \times 3} = 64\).
Inception Module with \(o_{1 \times 1} = 80\) and \(o_{3 \times 3} = 80\).
Inception Module with \(o_{1 \times 1} = 48\) and \(o_{3 \times 3} = 96\).
DownSample Module with \(o_{3 \times 3} = 96\).
Inception Module with \(o_{1 \times 1} = 176\) and \(o_{3 \times 3} = 160\).
Inception Module with \(o_{1 \times 1} = 176\) and \(o_{3 \times 3} = 160\).
Adaptative Average Pooling layer with kernel \(7 \times 7\).
Fully connected layer from \(16464\) to the number of classes (i.e, \(10\)).
Where the total number of parameters of this model is \(1.814.106\).
Figure 5.9.
For this experiment, all Inception models were trained using SGD with momentum \(0.9\) and learning rate \(0.01\) with exponential decay of \(0.95\). All models are trained for \(30.000\) iterations of batches of size \(200\) or until the train loss is under \(0.005\). These settings are selected to ensure that the random label model converges to an interpolator. Random cropping is employed using RandomResizeCrop function of torchvision with scale \((0.8, 1.0)\) and ratio \((0.9, 1.1)\). For \(\ell_2\) regularization, the multiplicative factor is \(0.01\).
Figure 5.10.
For this figure, Standard, L2-Crop, and Initial model from Figure 5.9 are used. Subsets of size \(n=50\) of CIFAR10’s test split are used to approximate samples of the data generating distribution and build the histograms.
Figure 5.13.
For this figure, a generalization of LeNet5 is used (three convolutional layers and two fully connected with ReLu activation and average pooling), where the number of channels of the convolutional layers was parameterized by \(k\). Precisely, the first layer had \(3\) input and \(\lfloor 6k \rceil\) output channels; the second layer \(\lfloor 6k \rceil\) input and \(\lfloor 16k \rceil\) output channels; and the last layer \(\lfloor 16k \rceil\) input and \(\lfloor 120k \rceil\) output channels. The set of models is created ranging \(k\) from \(0.2\) to \(4.9\) every \(0.1\); raising models from \(7k\) parameters to models with \(1.2M\) parameters. Each of these models is then trained until the train loss is lower than \(0.01\) or until the train loss has not lowered in two epochs (this only happens in the smallest models).
Figure 5.14.
The rate function of a subset of all the models in Figure 5.13 is computed here.
Figure 5.16.
The batch size is fixed to \(250\) and images are standardized (this was necessary to improve learning in the MLP model). The precise MLP has 3 hidden layers with \(512\) units, with a total of \(1.735.178\) parameters. All models are trained until the interpolation regime, that is, until the train loss is under \(0.015\), which, in the worst case where \(20.000\) iterations for the MLP. Inception models are trained using a learning rate of \(0.001\) whereas MLP models use \(0.1\), both with \(0.9\) momentum and \(0.95\) exponential decay. Regarding the data, \(D_0\) is CIFAR10’s test set, \(D_1\) is the result of performing random translations of \(5\%\) and \(D_2\) considers random translations of \(5\%\) and rotations of up to \(20\%\). Both transformations are computed using RandomAffine function of torchvision.
Figure 5.17.
The model’s specification and training setup is the same as in Figure 5.16. Regarding the data, the random shuffling of the pixels was performed using a random permutation using Numpy; the dataset was fully permuted and stored as a new dataset.
Figure 5.18.
The model’s specification and training setup is the same as in Figure 5.16. \(D_1\) considers random translations of \(5\%\) and rotations of up to \(20\%\) (the same as \(D_2\) in Figure 5.16). Data augmentation was produced using the same transformations as those that define \(D_1\).
Estimating the Cumulant and Rate Function
From the definition of the cumulant function \(J_{\bm{\theta}}(\lambda)\), \begin{equation} J_{\bm{\theta}}(\lambda) = \log \mathbb{E}_{\nu}\left[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\right] = \log \mathbb{E}_{\nu}\left[ p(\mathbf{y}|\mathbf{x},\bm{\theta})^\lambda\right] - \mathbb{E}_{\nu}[\log p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]\,, \end{equation} it is clear that computing its true value requires access to the true data generation distribution \(\nu\). However, in real-world problems, this distribution is unknown and inaccessible.
The Machine Learning community is used to approximate this kind of quantities (such as the expected loss \(L(\bm{\theta})\)) using separate validation datasets \(D^{val}\). In fact, due to the large amount of data available in nowadays’s problems, using this approach is perfectly doable, leading to \begin{equation} \label{eq:cummulant_estimation} J_{\bm{\theta}}(\lambda) \approx \log\left(\frac{1}{M}\sum_{(\mathbf{x}, \mathbf{y}) \in D^{val}} p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \right) - \frac{1}{M}\sum_{(\mathbf{x}, \mathbf{y}) \in D^{val}}\log p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda\,. \end{equation} It is important to notice that the above estimator is biased due to the first term and Jensen’s Inequality. In fact, \begin{equation} \begin{aligned} \mathbb{E}_{D^{val}}\left[\log\left(\frac{1}{M}\sum_{(\mathbf{x}, \mathbf{y}) \in D^{val}} p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \right)\right] &\leq \log\left( \mathbb{E}_{D^{val}} \left[\frac{1}{M}\sum_{(\mathbf{x}, \mathbf{y}) \in D^{val}} p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \right]\right)\\ &= \log\mathbb{E}_{\nu}\left[ p(\mathbf{y}|\mathbf{x},\bm{\theta})^\lambda\right]\,. \end{aligned} \end{equation} As a result, if the size of \(D^{val}\) is not large enough, the cumulant function might be underestimated.
In regard to computational stability, computing the estimation in Equation \(\eqref{eq:cummulant_estimation}\) can be computationally unstable due to the use of probabilities. The use of log-probabilities and log-sum-exp operations is encouraged as, \begin{equation} J_{\bm{\theta}}(\lambda) \approx \log\left(\sum_{(\mathbf{x}, \mathbf{y}) \in D^{val}} \exp (\lambda \log p(\mathbf{y}|\mathbf{x}, \bm{\theta})) \right) - \log M - \frac{1}{M}\sum_{(\mathbf{x}, \mathbf{y}) \in D^{val}}\lambda \log p(\mathbf{y}|\mathbf{x}, \bm{\theta})\,. \end{equation} From this, it is straightforward to compute the log-probabilities of the model (for example, skipping the softmax layer of a NN), multiply them by \(\lambda\) and compute the mean and log-sum-exp of these quantities.
Once the cumulant function has been approximated, computing the rate function relies on computing the optimal value of \(\lambda\), \begin{equation} \mathcal{I}_{\bm{\theta}}(a) = \sup_{\lambda > 0} \lambda a - J_{\bm{\theta}}(\lambda)\,. \end{equation} In this regard, trying to optimize the value of \(\lambda\) doing automatic optimization resulted in a very unstable method in the conducted experiments. Thus, the recommended and used method is using a binary search algorithm. Fixed a range in which to optimize lambda \([\lambda_{min}, \lambda_{max}]\), a binary search algorithm has complexity \(\mathcal{O}(\log_{2}(\lambda_{max} - \lambda_{min}))\). In fact, if (due to the nature of the problem) the needed value of \(\lambda_{max}\) is too large, one might perform the binary search in \([\log(\lambda_{min}), \log(\lambda_{max})]\), which has the same complexity but makes it easier to consider larger values of \(\lambda\).
It is clear that computing the rate function is more complex than computing only the cumulant function (as the former requires the latter). In fact, the next result shows that it might not be necessary to compute the rate function, as the cumulant might be enough to characterize smoothness.
Proposition 5.44. If \(\forall \lambda \geq 0\), \(J_{\bm{\theta}}(\lambda)\leq J_{\bm{\theta}'}(\lambda)\), then \(\forall a\geq 0\), it verifies \(\mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a)\).
Proof. Direct consequence of the \(\mathcal{I}_{\bm{\theta}}(a)\) being the Legendre transform of \(J_{\bm{\theta}}(\lambda)\). □
From this result, if a model \(\bm{\theta}\) has a higher cumulant function than another model \(\bm{\theta}'\), then \(\bm{\theta}\) is smoother than \(\bm{\theta}'\) and many results apply. Figure 5.19 clearly illustrates this case. Just plotting the cumulants is enough to understand which models are smoother.
Discussion and Limitations
This thesis has examined a growing body of evidence (Gastpar et al., 2024; Nagarajan & Kolter, 2019b; Wang et al., 2024; Zhang et al., 2017) indicating that bounds depending solely on the training sample are provably vacuous for over-parameterized model classes, motivating alternative approaches to understanding generalization in deep learning. In response, distribution-dependent bounds—those that depend explicitly on the data-generating distribution—were advocated. A distribution-dependent PAC–Chernoff bound (Theorem 5.17) was introduced and shown to be perfectly tight for any interpolator. Building on its complexity measure, a new notion of smoothness was proposed (Definition 5.21), providing a principled answer to the previously unresolved question of which interpolators generalize more effectively.
Section 5.3.5 demonstrated that PAC–Chernoff bounds capture the double-descent phenomenon, explaining how interpolators can achieve smaller generalization error even as the number of parameters increase.
Section 5.3.6 established that the complexity term of the PAC–Chernoff bound induces a regularizer that yields near-optimal performance for over-parameterized models interpolating the training data (Theorem 5.26). The remainder of the section showed that a wide array of existing regularizers act as proxies for this optimal construction; in some instances, refined variants were identified as effective surrogates for an interpolator’s generalization error. This analysis clarifies both the scope and the limitations of standard techniques such as parameter norms, distance from initialization, and input-gradient penalties.
Section 5.3.7 analyzed the impact of transformed inputs (Assumption 5.31), showing via the PAC–Chernoff bound in Equation \(\eqref{eq:chernoffbound:transformeddata}\) that interpolation becomes harder due to increased generalization error. The same framework, combined with additional results, explains why invariant architectures and data augmentation yield interpolators with smaller generalization error. The section concluded by unifying invariant architectures, data augmentation, and the explicit regularizers from Section 5.3.6.
Finally, Section 5.3.8 established over-parameterization as a necessary condition for achieving smooth interpolators, where smoothness can be characterized through parameter norms, Lipschitz constants, or the use of invariant architectures and data augmentation. This perspective reveals connections among approaches previously regarded as unrelated, and it clarifies how larger model classes can accommodate smooth interpolators with superior generalization.
In summary, distribution-dependent PAC–Chernoff bounds constitute a powerful framework for understanding—and improving—the generalization behavior of (over-parameterized) interpolators.
Discussion of Limitations
A principal limitation of this study is the assumption of a finite model class. It is anticipated that this restriction can be relaxed by leveraging recently proposed PAC–Bayes–Chernoff bounds (Casado et al., 2024), which are likewise distribution-dependent and rely on an analogous rate function. This suggests that the results may extend to infinite model classes, although a full development lies beyond the present scope. Additionally, the analysis does not address the role of stochastic gradient descent (SGD) in identifying interpolators with minimal generalization error.
A second limitation concerns the absence of explicit links between the smoothness notion introduced here and commonly used notions in the literature. Examples include characterizations via the largest Hessian eigenvalue (Nesterov, 2003), loss variation within a perturbation radius (Keskar et al., 2017), and Lipschitz continuity of the loss (Shalev-Shwartz & Ben-David, 2014). While the concept of smoothness adopted in this work is grounded in the PAC–Chernoff framework through its distribution-dependent complexity term, future research could investigate alignments and divergences with these established definitions. Such connections could clarify how the present framework complements or generalizes existing theory. The focus here has been on the implications of the proposed definition for generalization, leaving broader comparisons to subsequent work; this underscores an opportunity for follow-up studies to bridge the present notion of smoothness with the wider landscape of smoothness concepts in machine learning theory.
The Implicit Bias of Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) has become the standard method for training modern neural networks, enabling large-scale models that underpin many of today’s applications (Bottou, 2010). Beyond its efficiency as an optimization algorithm, SGD plays a central role in determining how well models generalize, particularly in highly overparameterized regimes where numerous parameter settings can achieve zero training error (Zhang et al., 2017). A long-standing observation is that SGD is not neutral among these solutions: it often converges to models with stronger generalization performance, an implicit bias that remains only partly understood.
One explanation attributes this bias to the randomness introduced by mini-batch sampling. The stochasticity of SGD has been linked to a tendency toward simpler or lower-complexity models, effectively acting as a form of implicit regularization during learning (Neyshabur et al., 2015b; Zou et al., 2021). Empirical studies even suggest that this implicit effect can outperform explicit regularizers in certain cases, such as linear regression (Zou et al., 2021). While such results highlight the importance of algorithmic regularization, existing theoretical approaches—often based on concentration inequalities or uniform convergence—tend to produce loose bounds or fail to capture the distributional nuances of deep models (Nagarajan & Kolter, 2019b). This motivates the exploration of alternative frameworks capable of describing the mechanisms behind SGD’s bias in greater detail.
In this thesis, the use of Large Deviation Theory (LDT) (Ellis, 2012; Touchette, 2009) is investigated as such a framework. A decomposition of the generalization error is derived, which consists of three parts: the expected loss, a concentration term that quantifies how tightly the empirical loss is distributed around its mean, and an abnormality term that reflects the rarity of observed deviations. This perspective reveals a sharp contrast between optimization methods: full-batch Gradient Descent (GD) is prone to models with poor concentration and abnormal deviations—explaining its overfitting behavior—while mini-batch SGD reduces both effects, favoring models that achieve lower generalization error. The main contributions of this thesis in this field of study can be summarized as follows:
The introduction of an LDT-based decomposition of the generalization error that distinguishes the roles of concentration and abnormal deviations.
The fact that GD and SGD exhibit fundamentally different biases is explored: GD tends to exploit poorly concentrated and abnormal empirical losses, while SGD avoids them.
The theory is validated using deep convolutional networks on CIFAR-10, showing that batch size and \(\ell_2\) regularization systematically influence concentration and abnormality in line with our predictions.
This approach represents an initial, but first, step in applying LDT to the study of SGD’s implicit bias, a central open problem in machine learning. LDT has long been used in physics, finance, and telecommunications to characterize fluctuations and rare events by quantifying deviations from typical behavior (Ellis, 2012; Touchette, 2009). Rather than offering a complete explanation of SGD’s implicit bias, this thesis shows that LDT provides a complementary perspective that clarifies why SGD tends to favor models with smaller generalization error.
Understanding the implicit regularization of SGD is of clear practical significance, as modern deep learning relies almost exclusively on SGD and its variants to train highly overparameterized models. Despite their ability to interpolate training data, these models often generalize well in practice, largely due to the biases induced by SGD. The proposed LDT-based perspective provides a principled explanation of this phenomenon that has the potential to guide the design of novel optimization strategies that could enhance reliability and robustness in real-world applications.
\(\hat{L}(D, \bm{\theta})\)
\(\mathcal{I}_{\bm{\theta}}(a)\)
\(\alpha(D, \bm{\theta})\)
Preliminaries
As in the previous sections of this chapter, let \(D\) be an i.i.d. sample of size \(n\) from an unknown distribution \(\nu(\mathbf{y},\mathbf{x})\). For each \(\bm{\theta} \in \bm{\Theta}\), the loss function \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is positive, yielding an expected loss \(L(\bm{\theta}) = \mathbb{E}_\nu[\ell(\mathbf{y},\mathbf{x},\bm{\theta})]\) and an empirical loss \(\hat{L}(D,\bm{\theta}) = \tfrac{1}{n} \sum_{i=1}^n \ell(\mathbf{y}_i,\mathbf{x}_i,\bm{\theta})\). Since the dataset \(D\) is random, \(\hat{L}(D,\bm{\theta})\) fluctuates around \(L(\bm{\theta})\), and the tightness of this concentration depends on the model. Figure 5.20 (left) illustrates this behavior with histograms of empirical losses for three InceptionV3 models (Szegedy et al., 2016). The Initial model (with Kaiming initialization) shows empirical losses tightly concentrated around its mean \(\ln 10\), the \(\ell_2\)-regularized model exhibits similarly strong concentration, while the Standard model displays a wider spread.
Empirical risk minimization (ERM) seeks \(\min_{\bm{\theta}} \hat{L}(D,\bm{\theta})\). The main challenge is ensuring that \(\hat{L}(D,\bm{\theta})\) remains close to \(L(\bm{\theta})\) (i.e., small generalization error). Two factors shape the difference between \(L(\bm{\theta})\) and a realization of \(\hat{L}(D,\bm{\theta})\): (i) the level of concentration of the distribution of \(\hat{L}(D,\bm{\theta})\) around \(L(\bm{\theta})\) (a low empirical loss can arise from a high-mean model if the distribution of the empirical loss \(\hat{L}(D, \bm{\theta})\) is wide), and (ii) the level of abnormality, i.e., whether a small observed empirical loss stems from a rare left-tail outcome from the distribution of \(\hat{L}(D, \bm{\theta})\). Models whose empirical loss is both well-concentrated around its mean \(L(\bm{\theta})\) and is not the result of a rare left-tail realization will have smaller generalization error.
In order to mathematically formalize these two factors, we use the so-called rate function, the central function in LDT, which is denoted by \(\mathcal{I}_{\bm{\theta}}(a):\mathbb{R} \to \mathbb{R}\), and it is defined as the Legendre transform of the cumulant generating function, denoted by \(J_{\bm{\theta}}(\lambda):\mathbb{R} \to \mathbb{R}^+\). In this section, we introduced a signed version of the rate function and consider the cumulant generating function of the model’s centered loss. These two functions are defined as \begin{equation} J_{\bm{\theta}}(\lambda) :=\ln \mathbb{E}_{\nu}\left[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\right]\,, \end{equation} and \begin{equation} \mathcal{I}_{\bm{\theta}}(a):=sign(a) \cdot \sup_{\lambda \in \mathbb{R}}\ \lambda a - J_{\bm{\theta}}(\lambda)\,, \end{equation} where \(\mathcal{I}_{\bm{\theta}}(a)\) is a signed rate function to make it invertible in \(\mathbb{R}\). The rate \(\mathcal{I}_{\bm{\theta}}(a)\) and the cumulant \(J_{\bm{\theta}}(\lambda)\) are well-defined, positive, and strictly monotonic real-valued functions, satisfying \(\mathcal{I}_{\bm{\theta}}(0)=0\) and \(J_{\bm{\theta}}(0)=0\) (Rockafellar, 1970).
Figure 5.20 (center) presents the rate functions for the three previously discussed InceptionV3 (Szegedy et al., 2016) neural networks. The rate functions clearly reflect the varying levels of concentration in the empirical losses, as depicted by the histograms in Figure 5.20 (left). The Initial model exhibits a prominent rate function, while the Standard model has a smaller rate function compared to the \(\ell_2\)-regularized model.
Gradient Descent
In this section, a novel decomposition of a model’s generalization error is introduced, formalizing the concept of abnormality in the generalization error, and demonstrating how full-batch GD is biased toward finding models with poorly concentrated empirical losses and whose realized empirical loss deviates abnormally from the expected loss.
Decomposing the empirical loss
The following result provides a decomposition of the empirical loss in terms of the expected loss \(L(\bm{\theta})\), the inverse of the signed rate function \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\), and a novel function \(\alpha: {\cal D} \times \bm{\theta} \to \mathbb{R}\). As discussed in the next section, \(\alpha(D, \bm{\theta})\) quantifies the degree of abnormality of the observed generalization error \(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})\) for the model \(\bm{\theta}\).
Proposition 5.45. For any \(D\sim\nu^n\) and any \(\bm{\theta} \in \bm{\Theta}\), it verifies that \begin{equation} \hat{L}(D, \bm{\theta}) = L(\bm{\theta}) - \mathcal{I}^{-1}_{\bm{\theta}}(\alpha(D, \bm{\theta}))\,, \end{equation} where \(\alpha:{\cal D}\times\bm{\Theta} \to \mathbb{R}\) is defined as \(\alpha(D, \bm{\theta}) := \mathcal{I}_{\bm{\theta}}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}))\).
Proof
Direct consequence of the signed rate function being invertible in \(\mathbb{R}\). □
Although the decomposition above is technically straightforward, it separates the empirical loss into two distinct components with clear interpretations. The first component is the expected loss \(L(\bm{\theta})\). The second component, which captures the deviation of the observed empirical loss from its expectation, corresponds to the generalization error. Within this term, the function \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) characterizes the level of concentration of the distribution of \(\hat{L}(D, \bm{\theta})\) around its mean \(L(\bm{\theta})\). As shown in Section 5.3.2, models with a larger rate function \(\mathcal{I}_{\bm{\theta}}(\cdot)\) exhibit stronger concentration, and thus their inverse rate function \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) is smaller. In fact, a second-order Taylor expansion of \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) around \(s = 0\) reveals a close connection with the standard deviation of the model loss, denoted by \(\sigma(\ell_{\bm{\theta}})\): \begin{equation} \label{eq:inverse_rate_variance} \mathcal{I}^{-1}_{\bm{\theta}}(s) \approx \text{sign}(s)\sqrt{2|s|}\,\sigma(\ell_{\bm{\theta}}) \quad \text{ where } \quad \sigma(\ell_{\bm{\theta}}) := \sqrt{\mathbb{E}_\nu\left[(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) - L(\bm{\theta}))^2\right]}. \end{equation} Finally, note that both \(L(\bm{\theta})\) and \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) are deterministic; according to Proposition 5.45, all the randomness in \(\hat{L}(D, \bm{\theta})\) arises from the abnormality value \(\alpha(D, \bm{\theta})\). In the following section, the value of \(\alpha(D, \bm{\theta})\) is shown to represent the degree of abnormality in the magnitude of the generalization error. \(\alpha(D, \bm{\theta})\) will be higher when the observed \(\hat{L}(D, \bm{\theta})\) comes from the tails of the distribution of the empirical loss and small otherwise. Since \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) increases monotonically with \(s\) (Rockafellar, 1970), a larger \(\alpha(D, \bm{\theta})\) value leads to a greater difference between \(L(\bm{\theta})\) and \(\hat{L}(D, \bm{\theta})\).
The Abnormality of the Generalization Error
A large gap between \(\hat{L}(D, \bm{\theta})\) and \(L(\bm{\theta})\) is considered highly unlikely, or abnormal, when the distribution of the empirical loss is tightly concentrated around its mean \(L(\bm{\theta})\), in which case \(\hat{L}(D, \bm{\theta})\) must originate from the tails of this distribution. In contrast, the same gap may not be abnormal for a model whose empirical loss is only weakly concentrated.
Using the approximation in Equation \(\eqref{eq:inverse_rate_variance}\), the following approximation for \(\alpha(D, \bm{\theta})\) can be derived: \begin{equation} \label{eq:rate_var_approx} \alpha(D, \bm{\theta}) \approx \frac{1}{2}sign\Big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})\Big)\left(\frac{L(\bm{\theta}) - \hat{L}(D, \bm{\theta})}{\sigma(\ell_{\bm{\theta}})}\right)^2\,. \end{equation} From this approximation, it is clear that large values of \(\alpha(D, \bm{\theta})\) correspond to situations where \(\hat{L}(D, \bm{\theta})\) lie abnormally far from its mean \(L(\bm{\theta})\). In general, Cramér’s Theorem (Theorem 5.16) shows that \(\alpha(D, \bm{\theta})\) asymptotically equals the (normalized) log-probability of observing a generalization error at least as large as \(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})\): \begin{equation} \label{eq:abnormality} \alpha(D, \bm{\theta}) \asymp -\frac{1}{n}\ln \mathbb{P}_{S \sim \nu^n}\Big(L(\bm{\theta})-\hat L(S,\bm{\theta})\geq L(\bm{\theta})-\hat{L}(D, \bm{\theta})\Big)\,. \end{equation} Given a fixed dataset \(D\), the abnormality rate \(\alpha(D, \bm{\theta})\) quantifies the log-probability that, for another dataset \(S\), the generalization error exceeds the one observed with \(D\). When comparing two models, \(\bm{\theta}\) and \(\bm{\theta}'\), if \(\alpha(D, \bm{\theta}) \geq \alpha(D, \bm{\theta}')\), then the likelihood of observing a generalization error at least as large as that of \(D\), for another dataset \(S \sim \nu^n\), is lower for \(\bm{\theta}\) than for \(\bm{\theta}'\). In this case, the observed generalization error for \(D\) was more abnormal under \(\bm{\theta}\) than under \(\bm{\theta}'\).
The following result shows how \(\alpha(D, \bm{\theta})\), as a random variable over \(D \sim \nu^n\), is highly related to an exponential distribution of parameter \(n\):
Theorem 5.46. For any \(\bm{\theta} \in \bm{\Theta}\), \(n>0\), and \(D \sim \nu^n\), the cumulative distribution of \(\alpha(D, \bm{\theta})\) satisfies \begin{align} &\forall s > 0 \quad \mathbb{P}_{D\sim \nu^n}\big(\alpha(D, \bm{\theta}) \geq s\big)\leq e^{-n |s|}\,,\\ &\forall s < 0 \quad \mathbb{P}_{D\sim \nu^n}\big(\alpha(D, \bm{\theta}) \leq s\big)\leq e^{-n |s|}\,, \end{align} and both inequalities are asymptotically tight, \begin{align} &\forall s > 0 \quad \mathbb{P}_{D\sim \nu^n}(\alpha(D, \bm{\theta})\geq s) \asymp e^{- n |s|}\,,\\ &\forall s < 0 \quad \mathbb{P}_{D\sim \nu^n}(\alpha(D, \bm{\theta})\leq s) \asymp e^{- n|s|}\,. \end{align}
Proof
From Chernoff’s bound, it verifies that \begin{align} &\forall a \geq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq a\big)\leq e^{-n |\mathcal{I}_{\bm{\theta}}(a)|},\\ &\forall a \leq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \leq a\big)\leq e^{-n |\mathcal{I}_{\bm{\theta}}(a)|}\,. \end{align} As a result, for any value of \(s \in \mathbb{R}\), taking \(a = \mathcal{I}^{-1}_{\bm{\theta}}(s)\), we got \begin{align} &\forall s \geq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(s)\big)\leq e^{-n |s|},\\ &\forall s \leq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \leq \mathcal{I}^{-1}_{\bm{\theta}}(s)\big)\leq e^{-n |s|}\,. \end{align} As \(\mathcal{I}_{\bm{\theta}}(\cdot)\) is a strictly monotonic and increasing function, we can apply it at both sides of the inequality inside the probability, giving us: \begin{align} &\forall s \geq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(\mathcal{I}_{\bm{\theta}}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})) \geq s\big)\leq e^{-n |s|},\\ &\forall s \leq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(\mathcal{I}_{\bm{\theta}}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})) \leq s\big)\leq e^{-n |s|}\,. \end{align} The asymptotic inequalities can be obtained by applying the same reasoning to Equation \(\eqref{eq:asympoticEquality}\), which is a direct consequence of Cramér’s Theorem. □
Theorem 5.46 shows that the tails of the distribution of \(\alpha(D, \bm{\theta})\) are always thinner than those of an exponential distribution with rate \(n\), denoted by \(\mathrm{Exp}(n)\). Importantly, this property holds regardless of the model or the data-generating distribution. This insight makes it possible to quantify the degree of abnormality in a model’s generalization error by locating the corresponding \(\alpha(D, \bm{\theta})\) value within the tail of an exponential distribution. For example, in a dataset of size \(50\ 000\), if \(\alpha(D, \bm{\theta}) \geq \tfrac{1}{50\ 000} \ln \tfrac{1}{0.01} \approx 0.0001\), the probability of observing such an event is less than \(1\%\). This provides a universal cut-off, valid for any model and any data-generating distribution.
The second result in Theorem 5.46 shows that for large datasets, \(\alpha(D, \bm{\theta})\) closely approximates a zero-centered double-exponential distribution, or Laplace distribution, regardless of the model or the data-generating distribution. This indicates that for large datasets, the stochasticity associated with \(\hat{L}(D, \bm{\theta})\) can be effectively represented by a Laplace distribution, independently of the model family or the underlying data-generating process. Figure 5.20 (right) illustrates this point with surprising accuracy. The figure shows how the empirical distribution of \(\alpha(D, \bm{\theta})\) for three very different InceptionV3 models trained on CIFAR10, where \(D \sim \nu^{50}\), closely resembles a double-exponential or Laplace distribution, even with such a small \(n\) value. In conclusion, the distribution of empirical loss for large \(n\) values can be expressed as: \begin{equation} \hat{L}(D, \bm{\theta}) \approx L(\bm{\theta}) - \mathcal{I}^{-1}_{\bm{\theta}}(s),\quad s\sim \text{Laplace}(0,n^{-1})\,, \end{equation} where the Laplace distribution is parameterized using its location \(0\) and scale \(n^{-1}\). The above equation resembles the reparameterization of a Gaussian distribution, particularly when considering the approximation given in Equation \(\eqref{eq:inverse_rate_variance}\). This perspective highlights a novel asymptotic approximation of the generalization error offered by LDT (Ellis, 2012), that, at first, differs from the one provided by the Central Limit Theorem.
Analysis of the Implicit Biases of Gradient Descent
Proposition 5.45 shows that the empirical loss \(\hat{L}(D, \bm{\theta})\) can be expressed as \(\hat{L}(D, \bm{\theta}) = L(\bm{\theta}) - \mathcal{I}^{-1}_{\bm{\theta}}(\alpha(D, \bm{\theta}))\). Minimizing \(\hat{L}(D, \bm{\theta})\) requires balancing two competing objectives. The first is reducing \(L(\bm{\theta})\), since models with lower expected loss naturally tend to achieve lower empirical loss. The second is increasing \(\mathcal{I}^{-1}_{\bm{\theta}}(\alpha(D, \bm{\theta}))\), which, due to the strict monotonicity of \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\), involves two aspects. One is maximizing \(\alpha(D, \bm{\theta})\), which promotes more abnormal left-tail realizations, leading to empirical losses that deviate more significantly from their mean. The other is maximizing \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\) for a given \(\alpha(D, \bm{\theta})\), which implicitly favors models whose empirical loss distributions are less concentrated around \(L(\bm{\theta})\)—a larger \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\) implies a lower \(\mathcal{I}_{\bm{\theta}}(\cdot)\) which corresponds to a less concentrated distribution.
Although it could be expected that an empirical-loss minimizer focuses solely on achieving a small \(L(\bm{\theta})\), Proposition 5.45 highlights a competing incentive: a model whose empirical loss distribution is more spread out can “luck into” a dataset \(D\) in the left tail, yielding a small \(\hat{L}(D, \bm{\theta})\). In practice, minimizing \(\hat{L}(D, \bm{\theta})\) by using GD will lean toward models that (i) reduce \(L(\bm{\theta})\) and (ii) inflate \(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})\). Due to the second effect, the models minimizing \(\hat{L}(D, \bm{\theta})\) often exhibit significant generalization error.
\(\hat{L}(D, \bm{\theta})\) and \(L(\bm{\theta})\)
\(\sigma(\ell_{\bm{\theta}})\)
\(\alpha(D, \bm{\theta})\)
Figure 5.21 illustrates these dynamics, where GD is approximated by using SGD with an exceptionally large batch size of \(5\ 000\). In Figure 5.21 (top left), \(\hat{L}(D, \bm{\theta})\) decreases monotonically, while \(L(\bm{\theta})\) initially decreases but later rises slightly. Figure 5.21 (top right) shows the variance of \(\hat{L}(D, \bm{\theta})\), reflecting its increasing concentration over time. Figure 5.21 (bottom) presents the abnormality rate \(\alpha(D, \bm{\theta})\), which steadily grows. Notably, GD leads to models where \(\hat{L}(D, \bm{\theta})\) deviates significantly from \(L(\bm{\theta})\). Actually, this difference is a highly unlikely event. Using Theorem 5.46, the probability of observing \(\alpha(D, \bm{\theta}) = 0.7\) with \(n = 50\, 000\) is smaller than \(e^{-50\, 000 \cdot 0.7} \approx 10^{-8\, 000}\), an astronomically small value. This underscores how GD explores a vast space of model realizations, some exhibiting extreme deviations from \(L(\bm{\theta})\).
Figure 5.21 also depicts the dynamics of SGD for small mini-batches, revealing a distinct optimization behavior. In contrast to full-batch GD, SGD converges to models that exhibit a different trade-off: the concentration of the distribution of \(\hat{L}(D, \bm{\theta})\) is higher, while the abnormality rate \(\alpha(D, \bm{\theta})\) remains lower. This results in models with better generalization error, as the empirical loss aligns more closely with the expected loss.
Stochastic Gradient Descent
SGD operates by selecting mini-batches \(B \subseteq D\) of size \(m\) from the training dataset without replacement. Informally, this process is denoted as \(B \sim D\), representing the distribution over mini-batches of size \(m\) sampled without replacement from \(D\).
The full empirical loss \(\hat{L}(D, \bm{\theta})\) can be expressed as the expected empirical loss over all mini-batches \(B\) sampled from \(D\), given by: \begin{equation} \label{eq:LhatMiniBatches} \hat{L}(D, \bm{\theta}) = \mathbb{E}_{B \sim D} \bigl[\hat{L}(B, \bm{\theta})\bigr], \end{equation} where \(\hat{L}(B, \bm{\theta})\) denotes the empirical loss computed only on the samples within the batch \(B\).
SGD minimizes this expected empirical loss \(\mathbb{E}_{B \sim D} \bigl[\hat{L}(B, \bm{\theta})\bigr]\) by iteratively updating parameters using noisy gradient estimates. These estimates are unbiased approximations of the true gradient of \(\mathbb{E}_{B \sim D} \bigl[\hat{L}(B, \bm{\theta})\bigr]\). Since this expectation is equal to \(\hat{L}(D, \bm{\theta})\), SGD effectively minimizes the same empirical loss as full-batch GD.
By Proposition 5.45, each batch loss \(\hat{L}(B, \bm{\theta})\) also admits a decomposition in terms of its expected loss \(L(\bm{\theta})\), the inverse rate function, and an “abnormality” term: \begin{equation} \label{eq:MiniBatchDecomp} \hat{L}(B, \bm{\theta}) \;=\; L(\bm{\theta}) \;-\; \mathcal{I}^{-1}_{\bm{\theta}}(\alpha\bigl(B,\bm{\theta}\bigr)). \end{equation} As discussed in Section 5.4.2.2, \(\alpha\bigl(B,\bm{\theta}\bigr)\) measures again the degree to which the empirical loss on \(B\) abnormally deviates from \(L(\bm{\theta})\). Taking expectations over mini-batches by combining Equations \(\eqref{eq:LhatMiniBatches}\) and \(\eqref{eq:MiniBatchDecomp}\): \begin{align} \hat{L}(D, \bm{\theta}) \;=\; L(\bm{\theta}) \;-\; \mathbb{E}_{B \sim D}\Bigl[\mathcal{I}^{-1}_{\bm{\theta}}(\alpha\bigl(B,\bm{\theta}\bigr))\Bigr]\label{eq:batch_decomposition}. \end{align} To gain further insight, let \(Q\) denote a distribution over the \(\alpha\) values induced by mini-batches, where \(\alpha_{\bm{\theta}}^B\) compactly denotes \(\alpha\bigl(B,\bm{\theta}\bigr)\): \begin{equation} \label{eq:taylorApprox} \alpha_{\bm{\theta}}^B \sim Q\bigl(\alpha_{\bm{\theta}}^B | D, \bm{\theta}\bigr)\,. \end{equation} In other words, \(Q(\alpha_{\bm{\theta}}^B | D, \bm{\theta})\) represents the distribution of the abnormality scores \(\alpha\) that arises when sampling mini-batches \(B\) from \(D\). With this notation, using Equation \(\eqref{eq:batch_decomposition}\), we can rewrite \(\hat{L}(D, \bm{\theta})\) as \begin{equation} \label{eq:LhatReparam} \hat{L}(D, \bm{\theta}) = L(\bm{\theta}) - \mathbb{E}_{\alpha_{\bm{\theta}}^B \sim Q(\cdot \,|\, D,\bm{\theta})} \Bigl[\mathcal{I}^{-1}_{\bm{\theta}}(\alpha_{\bm{\theta}}^B)\Bigr]. \end{equation} Equation \(\eqref{eq:LhatReparam}\) starts to reveal how the learning objective in SGD differs from the full-batch GD objective. While both objectives aim to minimize \(\hat{L}(D, \bm{\theta})\), the dynamics of SGD are shaped by the distribution \(Q(\alpha^B_{\bm{\theta}} | D, \bm{\theta})\), which is directly influenced by the variability in batch sampling. Extending the trade-offs discussed in Section 5.4.2.3 to this case, minimizing Equation \(\eqref{eq:LhatReparam}\) involves (i) reducing the expected loss \(L(\bm{\theta})\), (ii) maximizing \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\), and (iii) favoring models where \(Q(\alpha^B_{\bm{\theta}} | D, \bm{\theta})\) is concentrated around large \(\alpha_{\bm{\theta}}^B\) values. Points (i) and (ii) were already illustrated in Figure 5.21, and Figure 5.22 (top left) illustrates point (iii) by plotting the evolution of the mean of \(Q(\alpha^B_{\bm{\theta}} | D, \bm{\theta})\), denoted \(\mu\bigl(\alpha_{\bm{\theta}}^B\bigr):=\mathbb{E}_{Q(\cdot | D,\bm{\theta})}[\alpha_{\bm{\theta}}^B]\). Importantly, the distribution of \(Q(\alpha^B_{\bm{\theta}} | D, \bm{\theta})\) is determined by the size of the batches \(m\), which, as we will see later, introduces different biases in the optimization dynamics.
\(\mu(\alpha^B_{\bm{\theta}})\)
\(\mathbb{V}\bigl(\alpha_{\bm{\theta}}^B\bigr)\)
\(\phi(\alpha_{\bm{\theta}}^B)\)
To gain deeper insight into these new biases in the optimization dynamics of SGD, \(\mathcal{I}_{\bm{\theta}}^{-1}(\alpha_{\bm{\theta}}^B)\) is approximated by employing a second-order Taylor expansion of \(\mathcal{I}^{-1}_{\bm{\theta}}(\alpha)\) around the mean \(\mu(\alpha_{\bm{\theta}}^B)\). This yields: \begin{equation} \label{eq:IinvTaylor} \begin{aligned} &\mathbb{E}_{\alpha_{\bm{\theta}}^B\sim Q(\cdot \,|\, D, \bm{\theta})} \bigl[\mathcal{I}_{\bm{\theta}}^{-1}(\alpha_{\bm{\theta}}^B)\bigr] \\ &\quad\approx\; \mathcal{I}_{\bm{\theta}}^{-1}\bigl(\mu(\alpha_{\bm{\theta}}^B)\bigr) \;+\; \tfrac{1}{2} \,\nabla_{\alpha}^2 \,\mathcal{I}_{\bm{\theta}}^{-1}(\alpha) \,\bigl\rvert_{\alpha = \mu(\alpha_{\bm{\theta}}^B)} \,\mathbb{V}\bigl(\alpha_{\bm{\theta}}^B\bigr). \end{aligned} \end{equation} where \(\mathbb{V}(\alpha_{\bm{\theta}}^B)\) denotes the variance of \(\alpha_{\bm{\theta}}^B\), defined as \(\mathbb{V}(\alpha_{\bm{\theta}}^B):=\mathbb{E}_{B\sim D}[(\alpha_{\bm{\theta}}^B - \mu(\alpha_{\bm{\theta}}^B))^2]\).
From standard properties of the rate function (Rockafellar, 1970), \(\mathcal{I}_{\bm{\theta}}(a)\) is convex for \(a \geq 0\). As a result, its inverse, \(\mathcal{I}^{-1}_{\bm{\theta}}(\alpha)\), is concave for \(\alpha \geq 0\), which implies \(\nabla_\alpha^2 \mathcal{I}^{-1}_{\bm{\theta}}(\alpha) \leq0\) for \(\alpha \geq 0\). Using the approximation in Equation \(\eqref{eq:inverse_rate_variance}\), when \(\mu(\alpha_{\bm{\theta}}^B) \geq 0\): \begin{equation} \label{eq:SecondDerivativeApprox} \nabla_\alpha^2 \mathcal{I}^{-1}_{\bm{\theta}}(\alpha)\Bigl\rvert_{\alpha = \mu(\alpha_{\bm{\theta}}^B)} \;\approx\; -\,\bigl(2\,\mu(\alpha_{\bm{\theta}}^B)\bigr)^{-\tfrac{3}{2}} \,\sigma\bigl(\ell_{\bm{\theta}}\bigr), \end{equation} where \(\sigma\bigl(\ell_{\bm{\theta}}\bigr)\) is the standard deviation of the individual losses \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) under the data-generating distribution \(\nu\). Note that the condition \(\mu(\alpha_{\bm{\theta}}^B) \geq 0\) is not restrictive, as SGD’s bias towards large \(\alpha_{\bm{\theta}}^B\) values ensures this condition after a few iterations, as empirically shown in Figure 5.22 (top left). Substituting Equation \(\eqref{eq:SecondDerivativeApprox}\) into Equation \(\eqref{eq:IinvTaylor}\) and then back into Equation \(\eqref{eq:LhatReparam}\) yields \begin{equation} \label{eq:batch_decomposition:bias} \hat{L}(D, \bm{\theta}) \approx L(\bm{\theta}) \;-\; \mathcal{I}^{-1}_{\bm{\theta}}(\mu\bigl(\alpha_{\bm{\theta}}^B\bigr))\;+\; \frac{0.5\mathbb{V}\bigl(\alpha_{\bm{\theta}}^B\bigr)}{\bigl(2\,\mu(\alpha_{\bm{\theta}}^B)\bigr)^{3/2}} \sigma\bigl(\ell_{\bm{\theta}}\bigr)\,. \end{equation} Equations \(\eqref{eq:LhatReparam}\) and \(\eqref{eq:batch_decomposition:bias}\) further reveal how the SGD objective differs from the full-batch GD objective. When the batch size \(m\) equals the entire training set \(n\), the distribution \(Q(\alpha^B_{\bm{\theta}} | D, \bm{\theta})\) collapses to a Dirac delta at \(\alpha(D, \bm{\theta})\), since all sampled batches \(B\sim D\) coincide with \(D\). Consequently, \(\mathbb{V}(\alpha^B_{\bm{\theta}})\) vanishes. For smaller batch sizes, however, \(\mathbb{V}(\alpha^B_{\bm{\theta}})\) remains nonzero due to the intrinsic randomness of mini-batch sampling. As illustrated in Figure 5.22 (top right), this variance is notably higher for smaller batches.
Proposition 5.47.
Proof
Analysis of the Implicit Biases in SGD
Prior to analyzing the implicit biases of SGD, Equation \(\eqref{eq:batch_decomposition:bias}\) is expressed via the approximation in Equation \(\eqref{eq:inverse_rate_variance}\), yielding the following, more interpretable, decomposition of \(\hat{L}(D,\bm{\theta})\): \begin{equation} \label{eq:batch_decomposition:bias2} \hat{L}(D,\bm{\theta}) \approx L(\bm{\theta}) - \underbrace{\left( \sqrt{2\,\mu\bigl(\alpha_{\bm{\theta}}^{B}\bigr)} -\frac{0.5\,\mathbb{V}\bigl(\alpha_{\bm{\theta}}^{B}\bigr)} {\bigl(2\,\mu(\alpha_{\bm{\theta}}^{B})\bigr)^{3/2}} \right)}_{\displaystyle \phi(\alpha_{\bm{\theta}}^{B})} \,\sigma\bigl(\ell_{\bm{\theta}}\bigr). \end{equation} Here, \(\phi(\alpha_{\bm{\theta}}^{B})\) aggregates the effects of the mean and variance of \(\alpha_{\bm{\theta}}^{B}\) on the bias term in the empirical–loss decomposition; specifically, \(\phi\) increases monotonically with \(\mu(\alpha_{\bm{\theta}}^{B})\) and decreases with \(\mathbb{V}(\alpha_{\bm{\theta}}^{B})\). Empirical support for the accuracy of Equations \(\eqref{eq:batch_decomposition:bias}\) and \(\eqref{eq:batch_decomposition:bias2}\) is provided in Figure 5.23.
\(\hat{L}(D, \bm{\theta})\) vs Equation \(\eqref{eq:batch_decomposition:bias}\)
\(\hat{L}(D, \bm{\theta})\) vs Equation \(\eqref{eq:batch_decomposition:bias2}\)
Consider the minimization of composite objectives of the form \begin{equation} \label{eq:biasedoptimization} f(\bm{\theta}) \;=\; g(\bm{\theta}) \;+\; \gamma\,h(\bm{\theta}), \end{equation} with \(\gamma>0\). Optimization entails a trade-off between the primary term \(g(\bm{\theta})\) and the bias term \(h(\bm{\theta})\) (Boyd & Vandenberghe, 2004). Smaller values of \(\gamma\) emphasize \(g\), whereas larger values increase the influence of \(h\), shifting solutions toward minimizers of \(h\). The weight \(\gamma\) may depend on \(\bm{\theta}\).
Equation \(\eqref{eq:batch_decomposition:bias2}\) can be mapped to Equation \(\eqref{eq:biasedoptimization}\) by identifying \(g(\bm{\theta})=L(\bm{\theta})\), \(h(\bm{\theta})=\sigma(\ell_{\bm{\theta}})\), and \(\gamma=\phi(\alpha_{\bm{\theta}}^{B})\), which depend on \(\bm{\theta}\) and on the batch size \(m\). A smaller value of \(\phi(\alpha_{\bm{\theta}}^{B})\) weakens the preference for solutions with small \(\sigma(\ell_{\bm{\theta}})\), whereas a larger value strengthens it. Mini-batch stochasticity increases \(\mathbb{V}(\alpha_{\bm{\theta}}^{B})\) for smaller \(m\) (Figure 5.22, top right); since \(\phi\) decreases with \(\mathbb{V}(\alpha_{\bm{\theta}}^{B})\), smaller batches yield smaller \(\phi(\alpha_{\bm{\theta}}^{B})\). This is consistent with Figure 5.21 (top right), where smaller batch sizes are associated with models exhibiting lower \(\sigma(\ell_{\bm{\theta}})\), and with Figure 5.22 (bottom), which shows \(\phi(\alpha_{\bm{\theta}}^{B})\) decreasing as \(m\) decreases. The analysis therefore indicates a bias of SGD toward models whose empirical losses are more concentrated.
Equation \(\eqref{eq:batch_decomposition:bias2}\) can also be cast into the template in Equation \(\eqref{eq:biasedoptimization}\) by taking \(g(\bm{\theta})=L(\bm{\theta})\), \(h(\bm{\theta})=\phi(\alpha_{\bm{\theta}}^{B})\), and \(\gamma=\sigma(\ell_{\bm{\theta}})\). In this view, larger values of \(\mu(\alpha_{\bm{\theta}}^{B})\) directly increase the bias term \(\phi(\alpha_{\bm{\theta}}^{B})\), thereby favoring models for which the distribution \(Q(\alpha_{\bm{\theta}}^{B}| D,\bm{\theta})\) is centered at higher \(\alpha_{\bm{\theta}}^{B}\). However, smaller batch sizes are associated with smaller \(\sigma(\ell_{\bm{\theta}})\) (Figure 5.21, top right), which reduces the effective weight \(\gamma\) and weakens this preference. Consistently, Figure 5.22 (top left) shows that decreasing the batch size shifts the optimization toward models with lower \(\mu(\alpha_{\bm{\theta}}^{B})\).
The preceding arguments indicate that SGD exhibits an implicit bias toward solutions with smaller generalization error, driven by two effects: (i) a preference for models whose empirical loss is more concentrated around its expectation (smaller \(\sigma(\ell_{\bm{\theta}})\)); and (ii) an aversion to models exhibiting highly abnormal left–tail deviations (smaller \(\mu(\alpha_{\bm{\theta}}^{B})\)). Both effects strengthen as the mini-batch size decreases, so that small-batch SGD acts as an implicit regularizer that discourages convergence to poorly generalizing solutions. The empirical trends in Figures 5.21 and Figure 5.22 accord with this interpretation.
SGD and \(\ell_2\) Regularization
To assess the theoretical claims from the preceding section, an ablation study is conducted on the effect of introducing \(\ell_2\) regularization into SGD. The \(\ell_2\) penalty (weight decay) adds a term proportional to the squared Euclidean norm of the parameters to the objective, yielding the surrogate loss \(\hat{L}(D,\bm{\theta})+\gamma\|\bm{\theta}\|_2^2\) with regularization strength \(\gamma>0\). By constraining the parameter norm, \(\ell_2\) regularization alters the optimization trajectory of SGD and systematically biases it toward models of smaller norm, which are typically associated with improved generalization.
A theoretical justification for this behavior is provided in Section 5.3.6, where an upper bound on the inverse rate function is given in terms of the \(\ell_2\) norm of the parameters. Informally, for \(s>0\), \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s)\;\le\; \sqrt{2Ms\,\|\bm{\theta}\|_2^2}, \end{equation} for a model–dependent constant \(M>0\). Hence, smaller parameter norms entail a smaller inverse rate, i.e., greater concentration of the empirical loss.
Under this analysis, \(\ell_2\) regularization is expected to reinforce the SGD bias toward models with more concentrated losses. This explicit bias is independent of the implicit mini-batch–induced bias discussed earlier, which arises because smaller batches increase \(\mathbb{V}(\alpha_{\bm{\theta}}^{B})\) and thereby modify the trade-off term in Equation \(\eqref{eq:batch_decomposition:bias2}\). Consistent with this separation, Figure 5.22 (top right) shows that \(\mathbb{V}(\alpha_{\bm{\theta}}^{B})\) is essentially unchanged when comparing plain SGD at \(m=500\) with \(\ell_2\)-regularized SGD at the same batch size. Consequently, the two effects compound: \(\ell_2\) regularization further pushes the optimizer toward models with higher loss concentration (lower variance). This pattern is observed in Figure 5.21 (top right): at \(m=500\), \(\ell_2\)-regularized SGD explores models with greater concentration than plain SGD; moreover, the concentration level resembles that achieved by plain SGD with a smaller batch size \(m=250\), which inherently induces stronger implicit regularization.
A second consequence follows from the analysis of the bias toward models with less abnormal deviations. Because \(\ell_2\) regularization reduces the standard deviation \(\sigma(\ell_{\bm{\theta}})\), the effective trade-off parameter \(\gamma\) in the mapping \(g(\bm{\theta})=L(\bm{\theta})\), \(h(\bm{\theta})=\phi(\alpha_{\bm{\theta}}^{B})\), \(\gamma=\sigma(\ell_{\bm{\theta}})\) (Section 5.4.3.1) decreases, which weakens the tendency to favor models with large \(\mu(\alpha_{\bm{\theta}}^{B})\). Empirically, Figure 5.22 (left) shows that \(\ell_2\)-regularized SGD with \(m=500\) visits models with smaller abnormality levels (lower \(\mu(\alpha_{\bm{\theta}}^{B})\)) than plain SGD at the same batch size, and achieves abnormality levels comparable to those of plain SGD with \(m=250\). Thus, \(\ell_2\) regularization simultaneously reduces the dispersion of the empirical loss and mitigates the propensity to select models exhibiting large abnormal deviations, aligning with the theoretical predictions.
Related Work
The generalization properties of stochastic gradient descent (SGD) have been investigated from multiple angles. A prominent line of work attributes improved generalization to the tendency of SGD to converge to flat minima in the loss landscape. The notion of flat minima was introduced by Hochreiter & Schmidhuber (1997), who argued that solutions lying in wide, low-curvature regions exhibit better out-of-sample performance. Subsequent empirical and theoretical studies reported that small-batch SGD more frequently identifies flatter minima, whereas large-batch training is prone to sharper minima that correlate with degraded generalization (Keskar et al., 2017). Although the present analysis focuses on concentration properties of the empirical loss rather than the local geometry of the parameter space, the two perspectives are naturally connected: flatter minima are plausibly associated with predictors whose empirical losses display higher concentration and fewer abnormal deviations, thus aligning geometric flatness with distributional stability.
A complementary perspective emphasizes the implicit regularization provoked by SGD. Neyshabur et al. (2015c) argued that SGD implicitly biases solutions toward smaller parameter norms, in agreement with capacity-control principles that link norm-based complexity to generalization. Related work formalized spectral- and norm-based measures that correlate with test performance in deep networks (Bartlett et al., 2017). This line of research connects to concentration-based accounts through results showing that models with smaller norms admit tighter distribution-dependent control of generalization error; for instance, Section 5.3 established bounds that tie reduced parameter norms to smaller inverse rate functions, implying greater concentration of the empirical loss.
Another body of literature approaches SGD through concentration inequalities and stability arguments, treating the empirical loss of each hypothesis as a random variable (Bartlett et al., 2017; Golowich et al., 2018; Kawaguchi et al., 2022; Liang et al., 2019; Neyshabur et al., 2017a). While illuminating, many such results rely on uniform upper bounds that are known to be loose or even vacuous in highly over-parameterized regimes (Gastpar et al., 2024; Nagarajan & Kolter, 2019b). In addition, these approaches often do not differentiate the model-specific concentration behavior within the hypothesis class, potentially obscuring performance differences driven by distribution-dependent effects (Casado et al., 2024). By contrast, the analysis developed here employs a decomposition that isolates terms governing concentration and abnormal deviations of the empirical loss for each candidate model, thereby offering a more granular, distribution-dependent account of when SGD favors predictors with lower generalization error.
Discussion and Limitations
A perspective based on LDT has been developed to study the implicit bias of stochastic gradient descent (SGD). The generalization error was decomposed into two distribution-dependent components—loss concentration and abnormality—which together account for the tendency of SGD to favor solutions with smaller test error. In contrast to full-batch gradient descent, which is biased toward poorly concentrated and more abnormal empirical losses, mini-batch SGD implicitly promotes models with tighter concentration and fewer abnormal deviations, thereby acting as an implicit regularizer. Empirical results with deep convolutional networks support these claims: smaller batch sizes are consistently associated with more concentrated empirical-loss distributions and reduced abnormality. The introduction of explicit \(\ell_2\) regularization further strengthens this tendency, indicating that implicit and explicit regularization mechanisms can be complementary.
The present analysis is grounded in standard assumptions (e.g., i.i.d. sampling and smooth losses), while modern deep learning often involves non-convex objectives and non-i.i.d. data. Extending the LDT framework to account for dependence, heavy tails, or non-stationarity constitutes an important direction. Another avenue is to compare the predictive utility of the LDT-based decomposition against traditional generalization bounds and to relate it more directly to geometric and norm-based explanations (e.g., flat minima and spectral complexity).
The analysis presented here is subject to several limitations. First, the Large Deviation Theory (LDT) framework presupposes i.i.d. sampling, whereas real-world datasets may exhibit correlations, distributional shifts, or temporal dependence that violate this assumption. Second, LDT yields asymptotic guarantees as the dataset size grows (\(n\to\infty\)); in finite-sample regimes typical of deep learning, sub-exponential correction terms \(o(n,a)\) may be non-negligible and can affect the tightness of the resulting characterizations. Third, several theoretical insights rely on approximations (e.g., Equations \(\eqref{eq:taylorApprox}\) and \(\eqref{eq:IinvTaylor}\)) that introduce simplifying assumptions; while the empirical evidence reported here indicates that these approximations produce consistent predictions, their validity may weaken in settings with highly non-smooth losses, heavy-tailed noise, or strong dependence, and results should be interpreted accordingly.
Conclusions
This chapter develops a unified theoretical and empirical framework to explain why and how modern neural networks generalize, even in regimes of over-parameterization where classical statistical learning theory fails. Across Sections 5.1–Section 5.4, the chapter connects three seemingly distinct mechanisms—diversity, smoothness, and stochasticity—under a common probabilistic foundation rooted in PAC–Bayesian and large-deviation principles.
Diversity and Ensemble Generalization.
This chapter first establishes that diversity among predictors is essential for generalization in ensemble learning. Using a PAC–Bayesian formulation, it introduces a unified diversity measure that decomposes the ensemble generalization error into two interpretable terms:
the average individual model error, and
a diversity-dependent correction that captures correlations among predictors.
This framework rigorously explains the empirical success of deep ensembles: randomization mechanisms such as independent initialization or stochastic optimization encourage functional diversity, improving test performance. The theoretical analysis (Theorem 5.1 and Theorem 5.7) and experiments confirm that ensembles with greater diversity achieve lower generalization error, even when their individual networks operate in the interpolation regime.
Generalization Error and PAC–Chernoff Bounds.
The next section extends classical generalization bounds by deriving distribution-dependent PAC–Chernoff bounds that remain meaningful at interpolation. By expressing the generalization error in terms of a rate function from large-deviation theory, the analysis captures how model smoothness, regularization, and invariances determine the tightness of the bounds. This approach:
explains the double-descent phenomenon through the shape of the rate function,
shows that explicit regularizers (e.g., weight decay, distance-from-initialization, or gradient penalties) improve generalization by enlarging the rate function, and
unifies regularization, data augmentation, and architectural invariances under a single smoothness-based framework.
Implicit Bias of Stochastic Gradient Descent.
The final section analyzes SGD as a stochastic process using the same large-deviation perspective. It demonstrates that mini-batch stochasticity introduces an implicit regularization bias favoring flatter minima and models with higher loss concentration—hence, better generalization. A decomposition of the empirical loss in terms of the expected loss, rate function, and an “abnormality” term clarifies why full-batch gradient descent tends to overfit while SGD achieves robustness.
Overall Synthesis
Together, these sections present a coherent picture of generalization in modern deep learning:
Diversity explains how ensembles reduce variance through error de-correlation.
Smoothness and rate functions explain why certain interpolating models generalize better than others.
Stochastic dynamics explain how optimization implicitly regularizes without explicit penalties.
In essence, generalization arises from the interplay of diversity, smoothness, and concentration, all governed by probabilistic laws derived from PAC–Bayesian and large-deviation theory. This synthesis transcends classical VC and uniform-convergence analyses, offering a distribution-dependent understanding of deep learning generalization that bridges theory and practice.