Publications | Luis A. Ortega

2026

Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

Luis A. Ortega, Simon Rodriguez-Santana, Daniel Hernandez-Lobato

ESANN 2026 · Spotlight talk

Learns a surrogate neural kernel to avoid large Jacobians while estimating uncertainty for pre-trained networks.

LLA NTK Uncertainty Code

Abstract

We introduce a scalable method to approximate the kernel of the Linearized Laplace Approximation (LLA). For this, we use a surrogate deep neural network (DNN) that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). This avoids the need to compute large Jacobians. Training relies solely on efficient Jacobian-vector products, allowing to compute predictive uncertainty on large-scale pre-trained DNNs. Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Notwithstanding, biasing the learned kernel significantly enhances out-of-distribution detection. This remarks the benefits of the proposed method for finding better kernels than the NTK in the context of LLA to compute prediction uncertainty given a pre-trained DNN.

Code

2026

Improving the Linearized Laplace Approximation via Quadratic Approximations

Pedro Jimenez, Luis A. Ortega, Pablo Morales-Alvarez, Daniel Hernandez-Lobato

ESANN 2026

Rank-one quadratic factors improve fidelity to the full Laplace posterior while keeping prediction linearized.

QLA Laplace Regression

Abstract

Deep neural networks (DNNs) often produce overconfident out-of-distribution predictions, motivating Bayesian uncertainty quantification. The Linearized Laplace Approximation (LLA) achieves this by linearizing the DNN and applying Laplace inference to the resulting model. Importantly, the linear model is also used for prediction. We argue this linearization in the posterior may degrade fidelity to the true Laplace approximation. To alleviate this problem, without increasing significantly the computational cost, we propose the Quadratic Laplace Approximation (QLA). QLA approximates each second order factor in the approximate Laplace log-posterior using a rank-one factor obtained via efficient power iterations. QLA is expected to yield a posterior precision closer to that of the full Laplace without forming the full Hessian, which is typically intractable. For prediction, QLA also uses the linearized model. Empirically, QLA yields modest yet consistent uncertainty estimation improvements over LLA on five regression datasets.

2026

A Large Deviation Theory Analysis on the Implicit Bias of SGD

Luis A. Ortega, Andres R. Masegosa

Neurocomputing 2026

Uses large-deviation theory to explain why mini-batch SGD may prefer better-concentrated solutions.

SGD LDT Generalization Code

Abstract

Stochastic Gradient Descent (SGD) is the primary optimization method used in deep learning, yet the reasons behind its ability to select models that generalize effectively remain unclear. This paper develops a new perspective based on Large Deviation Theory (LDT). We show that the generalization error can be decomposed into three terms: the expected loss, a component that reflects the concentration of the empirical loss around its mean, and a component that captures the abnormality of deviations arising from stochastic sampling. This decomposition highlights a key difference between optimization methods: while full-batch Gradient Descent tends to exploit poorly concentrated and abnormal fluctuations—often leading to overfitting—mini-batch SGD naturally biases the search towards models with tighter concentration and fewer abnormal deviations. The analysis relies on standard assumptions such as i.i.d data and smooth loss functions. Experiments with deep convolutional networks support the theoretical findings, showing that smaller batch sizes and l2 regularization reinforce the preference for models with smaller generalization error. These results position LDT as a useful tool for understanding implicit regularization in SGD and suggest directions for extending this perspective to broader machine learning settings.

Code

2025

PAC-Chernoff Bounds: Understanding Generalization in the Interpolation Regime

Andres R. Masegosa, Luis A. Ortega

JAIR · ECAI 2025 Spotlight

A distribution-dependent PAC-Chernoff bound and smoothness framework for over-parameterized interpolators.

PAC-Chernoff Interpolation Bounds Code

Abstract

This paper introduces a distribution-dependent PAC-Chernoff bound that exhibits perfect tightness for interpolators, even within over-parameterized model classes. This bound, which relies on basic principles of Large Deviation Theory, defines a natural measure of the smoothness of a model, characterized by simple real-valued functions. Building upon this bound and the new concept of smoothness, we present an unified theoretical framework revealing why certain interpolators show an exceptional generalization, while others falter. We theoretically show how a wide spectrum of modern learning methodologies, encompassing techniques such as l2-norm, distance-from-initialization and input-gradient regularization, in combination with data augmentation, invariant architectures, and over-parameterization, collectively guide the optimizer toward smoother interpolators, which, according to our theoretical framework, are the ones exhibiting superior generalization performance. This study shows that distribution-dependent bounds serve as a powerful tool to understand the complex dynamics behind the generalization capabilities of over-parameterized interpolators.

PDF Code

2024

PAC-Bayes-Chernoff Bounds for Unbounded Losses

Ioar Casado, Luis A. Ortega, Aritz Perez, Andres R. Masegosa

NeurIPS 2024

Extends Cramer-Chernoff style bounds to PAC-Bayesian settings with unbounded losses.

PAC-Bayes Bounds Unbounded loss

Abstract

We introduce a new PAC-Bayes oracle bound for unbounded losses that extends Cramér-Chernoff bounds to the PAC-Bayesian setting. The proof technique relies on controlling the tails of certain random variables involving the Cramér transform of the loss. Our approach naturally leverages properties of Cramér-Chernoff bounds, such as exact optimization of the free parameter in many PAC-Bayes bounds. We highlight several applications of the main theorem. Firstly, we show that our bound recovers and generalizes previous results. Additionally, our approach allows working with richer assumptions that result in more informative and potentially tighter bounds. In this direction, we provide a general bound under a new model-dependent assumption from which we obtain bounds based on parameter norms and log-Sobolev inequalities. Notably, many of these bounds can be minimized to obtain distributions beyond the Gibbs posterior and provide novel theoretical coverage to existing regularization techniques.

PDF

2024

Variational Linearized Laplace Approximation for Bayesian Deep Learning

Luis A. Ortega, Simon Rodriguez-Santana, Daniel Hernandez-Lobato

ICML 2024

Approximates LLA via sparse variational Gaussian processes with sub-linear training costs.

Bayesian DL Variational GP LLA Code

Abstract

The Linearized Laplace Approximation (LLA) has been recently used to perform uncertainty estimation on the predictions of pre-trained deep neural networks (DNNs). However, its widespread application is hindered by significant computational costs, particularly in scenarios with a large number of training points or DNN parameters. Consequently, additional approximations of LLA, such as Kronecker-factored or diagonal approximate GGN matrices, are utilized, potentially compromising the model's performance. To address these challenges, we propose a new method for approximating LLA using a variational sparse Gaussian Process (GP). Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN. Furthermore, it allows for efficient stochastic optimization, which results in sub-linear training time in the size of the training dataset. Specifically, its training cost is independent of the number of training points. We compare our proposed method against accelerated LLA (ELLA), which relies on the Nyström approximation, as well as other LLA variants employing the sample-then-optimize principle. Experimental results, both on regression and classification datasets, show that our method outperforms these already existing efficient variants of LLA, both in terms of the quality of the predictive distribution and in terms of total computational time.

PDF Code

2024

The Cold Posterior Effect Indicates Underfitting

Yijie Zhang, Yi-Shan Wu, Luis A. Ortega, Andres R. Masegosa

TMLR 2024

Reframes the cold posterior effect as evidence of underfitting in misspecified Bayesian posteriors.

Bayesian DL Posterior Underfitting Code

Abstract

The cold posterior effect (CPE) (Wenzel et al., 2020) in Bayesian deep learning shows that, for posteriors with a temperature T<1, the resulting posterior predictive could have better performance than the Bayesian posterior (T=1). As the Bayesian posterior is known to be optimal under perfect model specification, many recent works have studied the presence of CPE as a model misspecification problem, arising from the prior and/or from the likelihood. In this work, we provide a more nuanced understanding of the CPE as we show that misspecification leads to CPE only when the resulting Bayesian posterior underfits. In fact, we theoretically show that if there is no underfitting, there is no CPE. Furthermore, we show that these tempered posteriors with (T<1) are indeed proper Bayesian posteriors with a different combination of likelihood and prior parameterized by T. This observation validates the adjustment of the temperature hyperparameter T as a straightforward approach to mitigate underfitting in the Bayesian posterior. In essence, we show that by fine-tuning the temperature T we implicitly utilize alternative Bayesian posteriors, albeit with less misspecified likelihood and prior distributions.

PDF Code

2023

Deep Variational Implicit Processes

Luis A. Ortega, Simon Rodriguez-Santana, Daniel Hernandez-Lobato

ICLR 2023

A multi-layer implicit-process generalization for flexible function-space inference.

DVIP Implicit Processes Inference Code

Abstract

Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameter-space approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, e.g., in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IP-based methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.

PDF Code

2022

Diversity and Generalization in Neural Network Ensembles

Luis A. Ortega, Rafael Cabanas, Andres R. Masegosa

AISTATS 2022

Connects ensemble diversity, generalization error, and common model-combination strategies.

Ensembles Diversity Generalization Code

Abstract

Ensembles are widely used in machine learning and, usually, provide state-of-the-art performance in many prediction tasks. From the very beginning, the diversity of an ensemble has been identified as a key factor for the superior performance of these models. But the exact role that diversity plays in ensemble models is poorly understood, specially in the context of neural networks. In this work, we combine and expand previously published results in a theoretically sound framework that describes the relationship between diversity and ensemble performance for a wide range of ensemble methods. More precisely, we provide sound answers to the following questions: how to measure diversity, how diversity relates to the generalization error of an ensemble, and how diversity is promoted by neural network ensemble algorithms. This analysis covers three widely used loss functions, namely, the squared loss, the cross-entropy loss, and the 0-1 loss; and two widely used model combination strategies, namely, model averaging and weighted majority vote. We empirically validate this theoretical analysis with neural network ensembles.

PDF Code

2022

Correcting Model Bias with Sparse Implicit Processes

Simon Rodriguez-Santana, Luis A. Ortega, Daniel Hernandez-Lobato, Bryan Zaldivar

ICML Workshop 2022

Shows sparse implicit processes can correct model bias when the assumed mechanism differs from the data.

SIP Implicit Processes Inference Code

Abstract

Model selection in machine learning (ML) is a crucial part of the Bayesian learning procedure. Model choice may impose strong biases on the resulting predictions, which can hinder the performance of methods such as Bayesian neural networks and neural samplers. On the other hand, newly proposed approaches for Bayesian ML exploit features of approximate inference in function space with implicit stochastic processes (a generalization of Gaussian processes). The approach of Sparse Implicit Processes (SIP) is particularly successful in this regard, since it is fully trainable and achieves flexible predictions. Here, we expand on the original experiments to show that SIP is capable of correcting model bias when the data generating mechanism differs strongly from the one implied by the model. We use synthetic datasets to show that SIP is capable of providing predictive distributions that reflect the data better than the exact predictions of the initial, but wrongly assumed model.

PDF Code

Ongoing

Flow-Transformed Implicit Processes for Function-Space Variational Inference

Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen

arXiv 2026

Uses normalizing flows over finite sampled-function combinations to enrich function-space variational inference with implicit-process priors.

FTIP Implicit Processes Variational inference Preprint

Abstract

Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box α objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.

Preprint

Ongoing

Contrastive Linearized Laplace: Loss Geometry, Valid Repairs, and Reliable Cores

Reviews when linearized Laplace is valid for contrastive objectives and uses repaired posteriors to identify stable downstream cores.

Contrastive LLA Repairs Reliable cores

Abstract

Contrastive learning is based on relations between examples, such as pairs, neighbors, or candidate sets, which makes uncertainty estimation different from standard supervised learning. We study linearized Laplace approximations for contrastive models and show that their validity depends on the curvature of the contrastive loss. InfoNCE and logit-based losses give valid positive semidefinite curvature, while distance-based and margin losses can create invalid negative curvature for negative pairs. We propose simple spectral and radial repairs that make the posterior well defined without retraining the encoder. We also show that embedding variance alone is not a reliable measure of uncertainty for contrastive models. A point may have high or low variance without this reflecting whether its nearest neighbors, cluster assignment, or retrieval decision is stable. These decisions depend on score margins, directions in embedding space, and correlations between competing relations. In this work, we propose to propagate LLA uncertainty to the downstream task by sampling encoders from the repaired posterior. This gives posterior probabilities for relations such as coassignment, neighborhood agreement, or prediction stability. We use these probabilities to select reliable cores: subsets of the downstream solution whose assignments remain stable under posterior perturbations. Experiments on synthetic data, EMNIST, CIFAR-100, genome clustering, and pathology show that these reliable cores give useful accuracy-coverage tradeoffs.

Ongoing

Fixed-Mean Gaussian Processes for ad-hoc Bayesian Deep Learning

Converting models to Bayesian predictors by creating a Gaussian process with fixed predictive mean.

Preprint

Ongoing

Regularization as Estimation, A PAC-Bayes-Chernoff Approach

A prescriptive framework that reframes regularization as a statistical estimation problem.

Ongoing

Revisiting the Marginal Likelihood through a PAC-Bayesian Lens

Generalization in Bayesian models depends on factors beyond marginal likelihood alone.

Published work and active research threads.

Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

Improving the Linearized Laplace Approximation via Quadratic Approximations

A Large Deviation Theory Analysis on the Implicit Bias of SGD

PAC-Chernoff Bounds: Understanding Generalization in the Interpolation Regime

PAC-Bayes-Chernoff Bounds for Unbounded Losses

Variational Linearized Laplace Approximation for Bayesian Deep Learning

The Cold Posterior Effect Indicates Underfitting

Deep Variational Implicit Processes

Diversity and Generalization in Neural Network Ensembles

Correcting Model Bias with Sparse Implicit Processes

Flow-Transformed Implicit Processes for Function-Space Variational Inference

Contrastive Linearized Laplace: Loss Geometry, Valid Repairs, and Reliable Cores

Fixed-Mean Gaussian Processes for ad-hoc Bayesian Deep Learning

Regularization as Estimation, A PAC-Bayes-Chernoff Approach

Revisiting the Marginal Likelihood through a PAC-Bayesian Lens