D

BayesiPy: Post-hoc Bayesian Inference for Pre-trained Neural Networks

Introduction

In contemporary deep learning, deterministic neural networks often achieve remarkable accuracy but remain overconfident in their predictions, especially under distributional shift. Estimating epistemic uncertainty—the uncertainty due to limited knowledge of model parameters—has therefore become a key research focus. Traditional Bayesian Neural Networks (BNNs) model a posterior over parameters \(P(\bm{\theta}| \mathcal{D})\), yet this is computationally prohibitive for large-scale architectures.

BayesiPy is a Python library (available at www.github.com/Ludvins/BayesiPy) providing a unified framework for post-hoc uncertainty estimation. Rather than retraining or modifying the backbone network, BayesiPy wraps a pre-trained model \(f(\mathbf{x};\bm{\theta}_{\mathrm{MAP}})\) with a Bayesian posterior approximation. This converts a deterministic predictor into a calibrated probabilistic model whose output includes both a predictive mean and an uncertainty estimate.

BayesiPy consolidates several state-of-the-art post-hoc Bayesian approximations, each offering distinct trade-offs between fidelity and scalability:

  1. Linearized Laplace Approximations (LLA) – including full, layerwise, subnetwork, and last-layer variants with different curvature approximations.

  2. Accelerated Linearized Laplace Approximation (ELLA) – a scalable, low-rank extension of LLA (Deng et al., 2022).

  3. Variational Linearized Laplace Approximation (VaLLA) – a variational Gaussian-process reinterpretation of LLA (Ortega et al., 2024a).

  4. Mean-Field Variational Inference (MFVI) – a classic weight-space mean-field variational approximation.

  5. Spectral-Normalized Gaussian Process (SNGP) – a deterministic yet distance-aware GP head for OOD robustness (Bartlett et al., 2017).

  6. Fixed-Mean Gaussian Process (FMGP) – a sparse variational GP with the DNN output as fixed mean (Ortega et al., 2024b).

Together, these methods span the main axis of post-hoc Bayesian deep learning: from local linearizations in weight-space (Laplace family) to function-space Gaussian processes (VaLLA, FMGP, SNGP).

Considered Approaches

In this section, the different methodologies implemented in BayesiPy are reviewed. Please refer to the original works for a better understanding of implementation matters. Table D.1 includes a summarised comparison of the methods included in the library.

Linearized Laplace Approximations (LLA)

The Linearized Laplace Approximation (LLA) (Immer et al., 2021) linearizes the neural network around its maximum-a-posteriori (MAP) parameters \(\bm \theta_{\mathrm{MAP}}\). Let \(f(\mathbf x;\bm \theta)\) denote the network output and \(\ell(\bm \theta) = -\log P(\mathcal{D}| \bm \theta) - \log P(\bm \theta)\) its negative log-posterior. A second-order Taylor expansion yields a local Gaussian posterior:

\begin{equation} P(\bm \theta | \mathcal{D}) \approx \mathcal{N}\big(\bm \theta_{\mathrm{MAP}},\, H^{-1}\big), \quad H = \nabla^2_{\bm \theta} \ell(\bm \theta)\big|_{\bm \theta=\bm \theta_{\mathrm{MAP}}}, \end{equation} where \(H\) is the Hessian or a curvature surrogate (e.g., Gauss–Newton). Predictive uncertainty for an unseen input \(\mathbf x_\ast\) follows a Gaussian distribution: \begin{equation} f(\mathbf x_\ast;\bm \theta) \sim \mathcal{N}(f(\mathbf x_\ast;\bm \theta), J_{\bm \theta_{\mathrm{MAP}}}(\mathbf x_\ast) H^{-1} J_{\bm \theta_{\mathrm{MAP}}}(\mathbf x_\ast)^{\top}). \end{equation}

Accelerated Linearized Laplace Approximation (ELLA)

ELLA (Deng et al., 2022) accelerates inference by approximating the neural tangent kernel (NTK) with a low-rank Nyström decomposition. Taking a subset of size \(M\) of the training data, ELLA constructs approximated features of size \(K\), much lower dimensional than the true Jacobian. BayesiPy implements this procedure, enabling near-real-time uncertainty calibration for large architectures.

Variational Linearized Laplace Approximation (VaLLA)

VaLLA (Ortega et al., 2024a) recasts LLA in function space using sparse Gaussian-process (GP) inference. Rather than storing or inverting \(H\), VaLLA uses the GP equivalence of LLA: \begin{equation} f(x) \sim \mathcal{GP}\big(f(x;\bm \theta_{\mathrm{MAP}}),\, J_{\bm \theta_{\mathrm{MAP}}}(x)\,H^{-1}\,J_{\bm \theta_{\mathrm{MAP}}}(x)^{\top}\big) \end{equation} and performs a sparse variational approximation to the GP using inducing variables \(u=f(Z)\) only for the covariance of the GP. From a theoretical point of view, two different sets of inducing locations are used, where the first set is used to fix the GP posterior mean to the pretrained model. The variational posterior \(Q(\mathbf u)=\mathcal{N}(\mathbf m,\mathbf S)\) minimizes \begin{equation} \mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_{Q(f)}[\log P(y|f)] - \mathrm{KL}(Q(\mathbf u)|P(\mathbf u)). \end{equation} This yields calibrated uncertainties with sub-linear dependence on data size, while retaining interpretability. The pre-trained model is kept as the posterior GP mean using the dual formulation of GPs in Hilbert spaces; as a result, \(\mathbf m\) is fixed and not optimized.

Given a test input \(x_\ast\), the predictive distribution under VaLLA follows directly from the sparse variational GP formulation: \begin{align} Q(f_\ast) &= \mathcal{N}\big(f(\mathbf x_\ast; \bm \theta_{\mathrm{MAP}}),\, \sigma_\ast^2\big), \\[3pt] \sigma_\ast^2 &= K_{\mathrm{NTK}}(\mathbf x_\ast, \mathbf x_\ast) + K_{\mathrm{NTK}}(\mathbf x_\ast, \mathbf Z) K_{\mathrm{NTK}}(\mathbf Z, \mathbf Z) ^{-1} K_{\mathrm{NTK}}(\mathbf Z, \mathbf x_\ast)\\[3pt] &- K_{\mathrm{NTK}}(\mathbf x_\ast, \mathbf Z) K_{\mathrm{NTK}}(\mathbf Z, \mathbf Z) ^{-1}\mathbf S K_{\mathrm{NTK}}(\mathbf Z, \mathbf Z) ^{-1}K_{\mathrm{NTK}}(\mathbf Z, \mathbf x_\ast), \end{align} where \(K_{\mathrm{NTK}}\) denotes the NTK kernel \(K_{\mathrm{NTK}}(\mathbf x, \mathbf x') = J_{\bm \theta_{\mathrm{MAP}}}(\mathbf x)\, J_{\bm \theta_{\mathrm{MAP}}}(\mathbf x')^{\top},\) and \(K_{ZZ}\) its Gram matrix over the inducing inputs \(Z\). This expression provides both the mean correction relative to the deterministic network and a closed-form posterior variance, making VaLLA a scalable Bayesianization of neural networks with predictive uncertainty that is consistent with the local curvature of the loss landscape.

Fixed-Mean Gaussian Process (FMGP)

The FMGP (Ortega et al., 2024b) uses the dual formulation of GPs in Hilbert spaces to generalize VaLLA to any kernel function (given some caveats). As a result, only kernel hyper-parameters and variational parameters of a sparse GP are optimized. Because the pre-trained model is kept as the mean of the posterior GP approximation, the base network’s accuracy is preserved while uncertainty is learned post-hoc.

Mean-Field Variational Inference (MFVI)

Given a pre-trained neural network with parameters \(\bm{\mu} = \{\mu_1, \dots, \mu_P\}\), MFVI (Deng & Zhu, 2023) approximates the posterior over weights by a fully factorized Gaussian: \begin{equation} P(\bm \theta|\mathcal{D}) \approx Q(\bm \theta) := \prod_i \mathcal{N}(\theta_i| \mu_i, \sigma_i^2), \end{equation} optimizing the evidence lower bound (ELBO): \begin{equation} \mathcal{L} = \mathbb{E}_{Q(\bm \theta)}[\log P(\mathcal{D}|\bm \theta)] - \mathrm{KL}(Q(\bm \theta)|P(\bm \theta)). \end{equation} Although it underestimates posterior covariance, MFVI provides a cheap Bayesian correction and serves as a baseline for comparison. The set of mean parameters \(\bm{\mu}\) can be either fixed or finetuned. The Flip-out trick is employed for variance-reduced gradient estimation (Wen et al., 2018).

The predictive distribution under MFVI is therefore given by \begin{equation} P(y_\ast | \mathbf x_\ast, \mathcal{D}) = \int P(y_\ast | \mathbf x_\ast, \bm{\theta}) Q(\bm{\theta})\, d\bm{\theta}, \end{equation} which is intractable in closed form for general neural networks. In practice, it is approximated by Monte Carlo integration: \begin{equation} P(y_\ast | x_\ast, \mathcal{D}) \approx \frac{1}{S} \sum_{s=1}^S P\left(y_\ast | \mathbf x_\ast, \bm{\theta}^{(s)}\right), \quad \bm{\theta}^{(s)} \sim Q(\bm{\theta}). \end{equation} This corresponds to averaging the model outputs over multiple stochastic forward passes, each with independently sampled weights from the variational posterior. For regression with Gaussian likelihood, the predictive mean and variance can be estimated as \begin{align} \mathbb{E}[f_\ast] &\approx \frac{1}{S}\sum_{s=1}^S f(\mathbf x_\ast; \bm{\theta}^{(s)}), \\ \mathrm{Var}[f_\ast] &\approx \frac{1}{S}\sum_{s=1}^S \left(f(\mathbf x_\ast; \bm{\theta}^{(s)}) - \mathbb{E}[f_\ast]\right)^{\!2}. \end{align} These expressions yield a practical, sampling-based approximation to Bayesian model averaging. Despite its strong independence assumptions, MFVI provides an efficient and widely used baseline for uncertainty estimation in neural networks.

Spectral-Normalized Gaussian Process (SNGP)

SNGP (Liu et al., 2023) augments a deterministic neural network with a distance-aware Gaussian process (GP) output layer, enabling calibrated uncertainty estimation and robust out-of-distribution (OOD) detection at deterministic inference cost. The method combines two key components: spectral normalization to enforce Lipschitz continuity of the feature extractor, and a random-feature approximation of a GP head that models predictive uncertainty.

Let \(h(\mathbf x)\) denote the hidden representation from the penultimate layer. SNGP replaces the final linear layer with a GP approximated by random Fourier features: \begin{equation} \Phi(\mathbf x) = \sqrt{\tfrac{2\sigma^2}{D_L}}\cos(-\bm W_L h(\mathbf x) + \bm b_L), \end{equation} where \(\bm W_L \sim \mathcal{N}(0,\tau \bm{I})\) and \(\bm b_L \sim \mathcal{U}(0,2\pi)\). This random-feature expansion approximates a radial basis function (RBF) kernel in feature space. The predictive model is a Bayesian linear regression over \(\Phi(\mathbf x)\): \begin{equation} P(y|\mathbf x) = \mathcal{N}\big(\Phi(\mathbf x)^\top \bm \beta,\, \Phi(\mathbf x)^\top \bm \Sigma_\beta \Phi(x)\big), \end{equation} with prior \(\bm \beta \sim \mathcal{N}(0,I)\) and posterior \(P(\bm \beta | \mathcal{D})\) obtained via a Laplace approximation. Thus \begin{equation} P(\bm \beta | \mathcal{D}) \approx \mathcal{N}(\bm{\beta}|\hat{\bm{\beta}}, \hat{\bm{\Sigma}}) \end{equation} where \(\hat{\bm{\beta}} = \argmax_{\bm{\beta}} P(\bm \beta | \mathcal{D})\), and \(\hat{\bm{\Sigma}}_{(i,j)} = - \frac{\partial^2}{\partial \beta_i \partial \beta_j}\log P(\bm \beta|\mathcal{D})_{|\bm{\beta} = \hat{\bm{\beta}}}\).

During training, spectral normalization is applied to each weight matrix \(\{\bm W_\ell\}_{\ell=1}^{L-1}\) in the model to constrain the feature extractor’s Lipschitz constant and preserve distance information in \(h(\mathbf x)\). Given minibatches \(\{(\mathbf x_m, y_m)\}\), the parameters of both the feature extractor and the random-feature GP head are updated using standard SGD steps.

For a test input \(\mathbf x_\ast\), SNGP computes: \begin{align} \Phi_\ast &= \sqrt{\tfrac{2\sigma^2}{D_L}}\cos(\bm W_L h(\mathbf x_\ast) + \bm b_L), \\[3pt] \mu_\ast &= \Phi_\ast^\top \hat{\bm \beta}, \\[3pt] \sigma_\ast^2 &= \Phi_\ast^\top \hat{\bm \Sigma}_\beta \Phi_\ast, \end{align} and outputs the predictive distribution \begin{equation} p(y_\ast|\mathbf x_\ast) = \int \mathrm{softmax}(m)\, \mathcal{N}\big(m\,\big|\,\mu_\ast,\,\sigma_\ast^2\big)\,\mathrm{d}m\,, \end{equation} for classification problems and the corresponding Gaussian distribution for regression problems.

Example Usage

Below is an example of applying BayesiPy’s FMGP wrapper to a pre-trained PyTorch model. The used kernel is the squared exponential, with inducing locations initialized using K-Means.

import copy
import torch
import numpy as np
from bayesipy.fmgp import FMGP

# Suppose f is your pre-trained PyTorch model
fmgp = FMGP(
    model=copy.deepcopy(f),
    likelihood="regression",
    kernel="RBF",
    inducing_locations="kmeans",
    num_inducing=50,
)

fmgp.fit(iterations=3000, lr=1e-3, train_loader=train_loader)
mean, var = fmgp.predict(torch.tensor(X_test, dtype=torch.float32))

For Linearized Laplace methods, the library allows importing methods from the Laplace Library (Immer et al., 2021), ELLA and VaLLA.

from bayesipy.laplace import Laplace, ELLA, VaLLA

lap = Laplace(model=f,
              approximation="kron",
              subset_of_weights="last_layer",
              likelihood="classification")

lap.fit(train_loader)
pred_mean, pred_covariance = lap.predict(test_loader)
Table D.1: Comparison of techniques. Insights adapted from the Laplace approximation library (Immer et al., 2021) and recent works such as (Ortega et al., 2024b).
Method Pros Cons Use Case
Full Laplace Most faithful local Gaussian approximation to MAP parameters; captures correlations in all parameters. Extremely memory- and computation-heavy; not feasible for large networks. Good for small-to-medium networks or when maximizing fidelity is paramount.
Layerwise / Subnet Balances coverage (whole network or a subset) with computational feasibility; can approximate key layers. More complex to configure (which layers to include? which blocks?); some approximation needed for the Hessian. For medium-to-large networks if you need more coverage than last-layer but cannot afford full Laplace.
Last-Layer Laplace Very fast post-hoc correction; no retraining, just a Hessian-based Gaussian around the final layer. Focuses only on last-layer uncertainty; can miss uncertainties originating in earlier layers. A quick Bayesian “upgrade” that often fixes overconfidence on moderate tasks.
ELLA Scalable variant of Laplace using low-rank or Nyström approximations. Hyperparameters for kernel approximation need tuning. Large-scale scenarios where even standard LLA is too costly.
VaLLA Excellent calibration (function-space GP perspective) with sub-linear complexity in data size. Requires iterative variational optimization; can be slower to converge; more complicated inference step. High-fidelity uncertainty for large datasets or critical applications.
MFVI Classic fully Bayesian approach over weights; easy to implement (Bayes by Backprop). Mean-field assumption often underestimates uncertainty; can be very slow or memory-heavy for large networks. For those wanting a “full BNN” approach or partial Bayesian layers with factorized posteriors.
SNGP Distance-aware; single forward pass at inference; strong OOD detection. Typically requires training from scratch or at least heavy fine-tuning with spectral normalization; not purely post-hoc. Production-friendly if you can integrate spectral norms and a GP head early on.
FMGP High-quality calibration; scalable to large data; easy to wrap any pre-trained model (fixed mean). Extra training to fit the GP’s variational parameters; number of inducing points is a hyperparameter that can affect memory usage. Post-hoc method offering advanced Bayesian-quality uncertainty for large-scale tasks.