Conclusions and Future Work

This dissertation investigates the intersection between Bayesian inference and modern deep learning, with the central goal of improving our understanding of how neural networks generalize and how their uncertainty can be more accurately quantified and improved.

Conclusions

Deep Variational Implicit Processes.

Chapter 3 developed the Deep Variational Implicit Process (DVIP), a scalable Bayesian model that extends the idea of implicit processes into deep architectures. The model is built upon the notion of implicit processes, which defines function distributions that are easy to sample from but lack tractable likelihoods. By composing these processes hierarchically, the DVIP framework produces non-Gaussian predictive distributions that are both expressive and amenable to variational inference in function space. Experimentation across regression and classification tasks showed that DVIPs perform on par with, or better than, deep Gaussian processes (DGPs) while being considerably more efficient. The experiments further illustrated how hierarchical structure, prior adaptation, and domain-specific priors (e.g., convolutional priors) contribute to improved expressivity and calibration. Overall, DVIP offers a principled and practical route toward uncertainty-aware deep learning.

Post-hoc Uncertainty Estimation.

Whereas DVIP integrates uncertainty estimation during training, Chapter 4 focused on methods to equip pre-trained deterministic networks with Bayesian uncertainty estimates after training. Two distinct approaches were introduced: the Variational Linearized Laplace Approximation (VaLLA) and the Fixed-Mean Gaussian Process (FMGP). Both derive from a sparse GP formulation that allows fixing the predictive mean to that of a pre-trained network, effectively transforming it into a calibrated probabilistic predictor without retraining. VaLLA extends the classical Laplace approximation to function space, offering scalable uncertainty estimates whose computational cost is independent of dataset size. FMGP, in contrast, treats the deterministic model as a fixed-mean GP, learning only the covariance structure—resulting in minimal overhead and broad architectural compatibility. Experiments demonstrated that both methods provide well-calibrated predictions on large-scale benchmarks, including ImageNet and molecular property datasets for FMGPs, illustrating practical ways to incorporate Bayesian uncertainty into existing models.

Generalization in Neural Networks.

The final methodological component, developed in Chapter 5, turned to one of the most fundamental open problems in modern deep learning: understanding why over-parameterized models generalize. The chapter proposed a unified theoretical framework linking three key factors—diversity, smoothness, and stochasticity—under a probabilistic formulation inspired by PAC–Bayes and large-deviation theory.

Section 5 demonstrated that functional diversity among ensemble models is crucial for good generalization in ensembles. A decomposition of the generalization error in terms of predictor diversity explained why mechanisms such as random initialization or stochastic training often lead to better test performance.

Section 5.3 extended these results through PAC–Chernoff bounds that remain informative even in interpolation regimes. These bounds express generalization error through a rate function describing concentration of empirical loss, offering a probabilistic explanation for phenomena like double descent. Classical regularization methods—including weight decay and distance-from-initialization constraints—emerged naturally within this framework as means of enlarging the rate function, thereby improving smoothness and generalization.

Section 5.4 examined stochastic gradient descent (SGD) through the same lens, showing that stochasticity itself functions as an implicit regularizer. By decomposing the empirical loss into its expected value, a rate function, and an abnormality term, the analysis clarified how mini-batch SGD tends to prefer flatter minima and more concentrated loss distributions, whereas full-batch optimization is more prone to overfitting. Collectively, these insights established a probabilistic, distribution-dependent view of generalization that ties together theory, optimization dynamics, and empirical findings.

In summary, this dissertation shows that probabilistic reasoning—when extended beyond its traditional scope—offers both practical tools for scalable uncertainty quantification and theoretical insight into generalization in deep learning. By bringing together Bayesian inference, large-deviation theory, and function-space modeling, this work contributes to a unified probabilistic perspective on learning that bridges theory, algorithms, and empirical behavior.

Future Work

Several natural extensions stem from this work.

Advances in Deep Variational Implicit Processes.

Future research on DVIP could explore alternative approximate inference techniques that do not rely on Gaussian or GP-based approximations. For instance, normalizing-flow or diffusion-based variational families could be used to model richer posterior dependencies in function space. Furthermore, richer base models beyond Bayesian linear approximations could be explored; Sparse implicit processes (SIP) offer a promising direction for improving posterior approximations. Beyond inference, DVIPs could benefit from more structured priors, such as recurrent or attention-based priors for sequence modeling, or convolutional priors tailored to spatial data. Integrating DVIP into autoencoder or generative architectures also represents a compelling opportunity to unify representation learning and uncertainty quantification within a single probabilistic framework.

Extensions of Post-hoc Uncertainty Methods.

For post-hoc methods such as VaLLA and FMGP, several improvements are conceivable. In VaLLA, more flexible variational approximations could replace the Gaussian assumption on the linearized posterior, potentially improving calibration in highly non-linear regions. For FMGP, future work could focus on designing task-specific kernels—particularly for images or spatio-temporal data—where structure-aware kernels can capture meaningful correlations between feature activations. Another interesting direction is the approximation of the Jacobian kernel, which could provide computationally efficient uncertainty estimates consistent with the local geometry of the network. Together, these developments would broaden the applicability of post-hoc Bayesianization methods to a wider range of architectures and modalities.

Future Theoretical Directions.

The theoretical framework in Chapter 5 also opens several avenues for further investigation. One direction involves relaxing some of the simplifying assumptions used in the large-deviation analysis, such as independence between samples, to better capture the behavior of modern training setups. Extending the analysis to structured data distributions, correlated noise, or non-i.i.d. regimes could provide a more realistic description of generalization in deep networks. Another promising line of work is connecting the PAC–Chernoff framework with complementary approaches, such as information-theoretic generalization bounds or algorithmic stability theory. Finally, the rate-function formalism could be applied to study other phenomena in deep learning—such as transfer learning, continual learning, or the behavior of large language models—where distribution-dependent generalization remains poorly understood.