3

Deep Variational Implicit Processes

Building on the theoretical background introduced in the previous chapter—particularly the Bayesian treatment of learning and the role of Gaussian process priors—this chapter explores a practical and scalable realization of function-space inference: the Deep Variational Implicit Process (DVIP). This framework extends the idea of implicit stochastic processes to hierarchical, deep architectures, allowing one to perform variational inference directly in function space without requiring explicit parametric densities. The result is a principled Bayesian approach that combines the flexibility of deep networks with the calibration and uncertainty-awareness of Gaussian processes.

Introduction

The Bayesian approach has become popular for capturing the uncertainty associated with the predictions made by models that otherwise provide point-wise estimates, such as NNs (Gal, 2016; Gelman et al., 2013; Murphy, 2012). However, when carrying out Bayesian inference, obtaining the posterior distribution in the space of parameters can become a limiting factor since it is often intractable. Symmetries and strong dependencies between parameters make the approximate inference problem much more complex. This is precisely the case in large deep NNs. Nevertheless, all these issues can be alleviated by carrying out approximate inference in the space of functions, which presents certain advantages due to the simplified problem. This makes the approximations obtained in this space more precise than those obtained in parameter space, as shown in the literature (Ma et al., 2019; Ma & Hernández-Lobato, 2021; Santana et al., 2021; Sun et al., 2019).

A recent method for function-space approximate inference is the Variational Implicit Process (VIP) (Ma et al., 2019). VIP considers an Implicit Process (IP) as the prior distribution over the target function. IPs constitute a very flexible family of priors over functions that generalize Gaussian processes (Ma et al., 2019). Specifically, IPs are processes that may lack a closed-form expression but that are easy to sample from. Examples include Bayesian neural networks (BNN), neural samplers, and warped GPs, among others (Santana et al., 2021). Nevertheless, the posterior process of an IP is intractable most of the time (except in particular cases as in GPs). VIP addresses this issue by approximating the posterior using the posterior of a GP with the same mean and covariances as the prior IP. Thus, the approximation used in VIP results in a Gaussian predictive distribution, which may be too restrictive.

Recently, the concatenation of stochastic processes has been used to produce models of increased flexibility. An example is DGPs in which a GP is used as the input of another GP, systematically (Damianou & Lawrence, 2013). Based on the success of DGPs, it is natural to consider the concatenation of IPs to extend their capabilities in a similar fashion to DGPs. Therefore, DVIP is introduced, a multi-layer extension of VIP that provides increased expressive power, enables more accurate predictions, gives better calibrated uncertainty estimates, and captures more complex patterns in the data. Each layer contains several IPs that are approximated using VIPs. Importantly, the flexibility of the IP-based prior formulation enables numerous models as the prior over functions, leveraging the benefits of e.g. convolutional NNs, that increase the performance on image datasets. Critically, DVIPs can adapt the prior IPs to the observed data, resulting in improved performance. When GP priors are considered, DVIPs are equivalent to a DGP. Thus, it can be seen as a generalization of DGPs.

Approximate inference in DVIPs is done via VI; where computational scalability is achieved in each unit using a linear approximation of the GP that approximates the prior IP, as in VIP (Ma et al., 2019). The predictive distribution of a VIP is Gaussian. However, since the inputs in the second and following layers are random in DVIPs, the final predictive distribution is non-Gaussian. This predictive distribution is intractable. Nevertheless, one can easily sample from it by propagating samples through the IP network. This also enables a Monte Carlo approximation of the VI objective which can be optimized using stochastic techniques, as in DGPs (Salimbeni & Deisenroth, 2017). Generating the required samples is straightforward given that the variational posterior depends only on the output of the previous layers. This results in an iterative sampling procedure that can be conducted in a scalable manner. Importantly, the direct evaluation of covariances is not needed in DVIPs, further reducing its cost compared to that of DGPs. The predictive distribution is a mixture of Gaussians (non-Gaussian), more flexible than that of VIP.

DVIPs are evaluated in several experiments, both in regression and classification. They show that DVIPs outperform a single-layer VIP with a more complex IP prior. DVIPs give results similar to and often better than those of DGPs (Salimbeni & Deisenroth, 2017), while having a lower cost and improved flexibility (due to the more general IP prior). The conducted experiments also show that adding more layers in DVIPs does not overfit and often improves results.

Implicit Processes

The Gaussian assumption underlying classical GPs is arguably restrictive in modern machine–learning settings that involve highly non-linear, strongly heteroscedastic, or structured outputs. Implicit Processes (IPs) proposed by Ma et al. (2019) constitute a broad non-parametric family whose sample paths are obtained by transforming a base stochastic process through a measurable mapping. Formally,

Definition 3.1 Implicit process. Let \((\Omega,\mathcal F,P)\) be a probability space and \(g:\Omega\times\mathcal X\to\mathbb R\) a measurable map that is jointly continuous in the second argument. The collection of random variables \(\bigl\{f(\mathbf x)=g(\omega,\mathbf x):\mathbf x\in\mathcal X\bigr\}\) is an implicit process, written as \(f \sim \operatorname{IP}(g, P)\).

Typical examples of implicit processes include neural networks and warped GPs. Figure 3.1 shows graphical representations of these two cases.

  1. Neural Networks: where \(g\) is the function that represents the neural network, \(\omega\) is the set of weights and \(\omega\sim P\) a distribution over them.

  2. Warped GPs: where \(g(\omega,\mathbf x)=\tau(f_{0}(\mathbf x))\) with \(f_{0}\sim\mathcal{GP}(0,K)\) and \(\tau:\mathbb R\to\mathbb R\) a non-linear warping. Where the stochasticity of the process usually comes solely from the GP.

image image

Figure 3.1: Graphical fragments of two prototypical implicit‐process priors. Left: A fully-connected neural-network. Circles (e.g. \(H_{1}\)\(H_{3},f\)) are latent random variables; squares are deterministic inputs or biases. Black arrows indicate conditional mappings, and a tiny green bell curve is super-imposed on each edge to emphasize the underlying Gaussian component of the stochastic mapping. Right: A warped Gaussian process. The latent GP \(f_{0}\sim\mathrm{GP}(0,\kappa)\) receives the input \(\mathbf x\) and is deterministically transformed by a non-linear warping \(\tau:\mathbb R\to\mathbb R\), yielding the non-Gaussian output \(f=\tau(f_{0}(\mathbf x))\). The label “GP’’ highlights that only the first stage is Gaussian; the composite process inherits the flexibility of the warp.

Unlike GPs, an IP generally lacks closed-form finite-dimensional distributions, yet Kolmogorov’s extension theorem (Kolmogorov, 1956) guarantees the existence of a unique probability measure on the path space provided the consistency axioms hold. In practice, IPs are manipulated through samples; the absence of an explicit density requires approximate inference.

Theorem 3.2 Kolmogorov Extension Theorem (Kolmogorov, 1956). Let \(\mathcal X\) be an arbitrary index set. For every finite subset \(\mathcal J\subset\mathcal X\) let \(P_{\mathcal J}\) be a probability measure on \(\bigl(\mathbb R^{\mathcal J},\,\mathcal B(\mathbb R)^{\otimes\mathcal J}\bigr)\) such that for all finite \(\mathcal I\subset\mathcal J\subset\mathcal X\), \begin{equation} \pi_{\mathcal I\mathcal J\,\sharp}\,P_{\mathcal J}=P_{\mathcal I}, \qquad \pi_{\mathcal I\mathcal J}:\mathbb R^{\mathcal J}\longrightarrow\mathbb R^{\mathcal I} \text{ the natural coordinate projection.} \end{equation} Then there exists a unique probability measure \(P\) on \(\bigl(\mathbb R^{\mathcal X},\,\mathcal B(\mathbb R)^{\otimes\mathcal X}\bigr)\) whose finite-dimensional marginals coincide with the given family \(\{P_{\mathcal J}\}_{\mathcal J\subset\mathcal X}\).

Recall the definition of an implicit process: \(f(\mathbf x)=g(\omega,\mathbf x)\) with \(\omega\sim P\) and measurable \(g:\Omega\times\mathcal X\to\mathbb R\). For any finite \(\mathcal J=\{\mathbf x_{1},\dots,\mathbf x_{k}\}\) define the random vector \begin{equation} \mathbf f_{\mathcal J}(\omega) :=\bigl(g(\omega,\mathbf x_{1}),\dots,g(\omega,\mathbf x_{k})\bigr) \in\mathbb R^{\mathcal J}. \end{equation} Its distribution is \(P_{\mathcal J}:=P\circ\mathbf f_{\mathcal J}^{\, -1}\), well-defined because \(g\) is measurable.

Lemma 3.3. The family \(\{P_{\mathcal J}\}_{\mathcal J\subset\mathcal X}\) satisfies the Kolmogorov Extension Theorem; that is, for every pair of finite index sets \(\mathcal I\subset\mathcal J\) it verifies that \(\pi_{\mathcal I\mathcal J\,\sharp}P_{\mathcal J}=P_{\mathcal I}\).

Proof

For a given finite set \(\mathcal J=\{\mathbf x_{1},\dots,\mathbf x_{k}\}\subset\mathcal X\), recall that \begin{equation} \mathbf f_{\mathcal J}:\Omega\longrightarrow\mathbb R^{\mathcal J}, \qquad \mathbf f_{\mathcal J}(\omega) =\bigl(g(\omega,\mathbf x_{1}),\dots,g(\omega,\mathbf x_{k})\bigr). \end{equation} By construction \(g\) is measurable in \(\omega\), so \(\mathbf f_{\mathcal J}\) is a measurable map between \((\Omega,\mathcal F)\) and \(\bigl(\mathbb R^{\mathcal J},\mathcal B(\mathbb R)^{\otimes\mathcal J}\bigr)\). Set \(P_{\mathcal J}:=P\, \circ\, \mathbf f_{\mathcal J}^{-1}\). For a fixed finite set \(\mathcal I\subset\mathcal J\), the canonical projection \(\pi_{\mathcal I\mathcal J}:\mathbb R^{\mathcal J}\to\mathbb R^{\mathcal I}\) verifies that for any \(y=(y_x)_{x\in\mathcal J}\in\mathbb R^{\mathcal J}\), \begin{equation} \pi_{\mathcal I\mathcal J}(y) :=(y_x)_{x\in\mathcal I}\;\in\mathbb R^{\mathcal I}. \end{equation} Observe that applying \(\pi_{\mathcal I\mathcal J}\) to \(\mathbf f_{\mathcal J}(\omega)\) simply discards the components with indices in \(\mathcal J\setminus\mathcal I\): \begin{equation} \pi_{\mathcal I\mathcal J}\bigl(\mathbf f_{\mathcal J}(\omega)\bigr) =\bigl(g(\omega,\mathbf x)\bigr)_{\,\mathbf x\in\mathcal I} =\mathbf f_{\mathcal I}(\omega), \qquad\forall\,\omega\in\Omega. \end{equation} Hence, as pointwise maps, \begin{equation} \pi_{\mathcal I\mathcal J}\circ\mathbf f_{\mathcal J} =\mathbf f_{\mathcal I}. \end{equation} Using the definition of push-forward measures (\(f_\sharp Q:=Q\circ f^{-1}\)), it verifies that \begin{equation} \pi_{\mathcal I\mathcal J\,\sharp}P_{\mathcal J} =\pi_{\mathcal I\mathcal J\,\sharp}\bigl(P\circ\mathbf f_{\mathcal J}^{-1}\bigr) =P\circ\bigl(\mathbf f_{\mathcal J}\bigr)^{-1}\circ \pi_{\mathcal I\mathcal J}^{-1} =P\circ\bigl(\pi_{\mathcal I\mathcal J}\circ\mathbf f_{\mathcal J}\bigr)^{-1} =P\circ\bigl(\mathbf f_{\mathcal I}\bigr)^{-1} =P_{\mathcal I},. \end{equation} which is exactly the desired consistency identity. Since the argument holds for every finite \(\mathcal I\subset\mathcal J\subset\mathcal X\), the family \(\{P_{\mathcal J}\}\) is consistent, completing the proof.

There exists a unique probability measure \(P_{\, \mathrm{IP}}\) on \(\mathbb R^{\mathcal X}\) such that the finite-dimensional distributions of the implicit process \(f(\cdot)\) coincide with \(\{P_{\mathcal J}\}\). Consequently, \(f\) is a stochastic process.

Variational Implicit Processes

VIPs build a GP prior distribution based on samples taken from the implicit process formulation. The resulting construction relies on a set of linear regression coefficients that encode how the IP samples are combined. More precisely, given \(S\) i.i.d. draws \(\{f_{s}\}_{s=1}^{S}\sim\mathrm{IP}(g, P)\), empirical features can be created as: \begin{equation} \hat{\boldsymbol\phi}(\mathbf x)=\frac{1}{\sqrt{S}}\bigl(f_{0}(\mathbf x) -\hat m(\mathbf x), \dots , f_{S}(\mathbf x) -\hat m(\mathbf x)\bigr)^T \end{equation} with empirical mean \(\hat m(\mathbf x):=S^{-1}\sum_{s}f_{s}(\mathbf x)\). Define a Bayesian linear model using those centered features as \(f(\mathbf x)=\hat m(\mathbf x)+\hat{\boldsymbol\phi}(\mathbf x)^{\mathsf T}\mathbf a\), \(\mathbf a\sim\mathcal N(\mathbf0,\mathbf I)\). The resulting underlying prior model is then a GP with mean and covariance functions defined as \begin{equation} m^\star(\mathbf{x}) = \frac 1 S \sum_{s=1}^{S} f_s(\mathbf{x})\,, \quad \quad \mathcal{K}^\star(\mathbf{x}_1, \mathbf{x}_2) = \bm \phi(\mathbf{x}_1)^\text{T} \bm \phi(\mathbf{x}_2)\,. \end{equation} The true posterior \(P(\bm a | \cal{D})\) is intractable in classification problems and has a computational cost of \(\mathcal{O}(N S^2)\) for regression. However, VI can be used to optimize a variational posterior \(Q(\bm a)\) by maximizing the ELBO. \begin{equation} \mathcal L_{\mathrm{VIP}} =\mathbb E_{Q(\mathbf f)}[\log P(\mathbf y|\mathbf f)] -\mathrm{KL}[Q(\mathbf a)|P(\mathbf a)], \qquad Q(\mathbf a)=\mathcal N(\boldsymbol\mu_{\mathbf a},\mathbf S_{\mathbf a})\,, \label{eq:vip_elbo} \end{equation} by stochastic gradient ascent using mini-batches of size \(B\); each iteration scales as \(\mathcal O(BS^{2})\). The optimal \((\boldsymbol\mu_{\mathbf a},\mathbf S_{\mathbf a})\) minimizes the KL divergence between the surrogate and the intractable IP posterior. Empirically, VIP delivers calibrated predictive uncertainty on regression and Bayesian optimization benchmarks with \(S\approx 100\) features (Ma et al., 2019).

Critically, the samples \(f_s(\mathbf{x})\) keep the dependence w.r.t. the IP prior parameters \(\bm \omega\) (e.g. the BNN parameters), which enables prior adaptation to the observed data in VIP (Ma et al., 2019). Given \(Q(\bm a)\), it is straightforward to make predictions. A limitation is, however, that the predictive distribution for \(y\) is Gaussian in regression. More precisely, given a test point \(\mathbf{x}^\star\), the induced distribution over function evaluations \(Q(\mathbf f)\) is Gaussian given by \begin{equation} Q(\mathbf{f}^\star) = \mathcal{N}\big(m(\mathbf{x}^\star) + \phi(\mathbf{x}^\star)^T \bm m\,, \bm \phi(\mathbf{x}^\star) ^T \bm S \bm \phi(\mathbf{x}^\star)\big)\,, \end{equation} which is straightforward given the linear approximation \(f(\mathbf{x}) = \bm \phi(\mathbf{x})^\text{T} \bm{a} + m^\star(\mathbf{x})\) and the variational distribution \(Q(\bm{a}) \sim \mathcal{N}(\bm m, \bm S)\). Using a Gaussian likelihood \(P(y | \mathbf{f}) = \mathcal{N}(0, \sigma^2)\), it leads to the predictive distribution \begin{equation} \begin{aligned} P(y^\star|\mathbf{x}^\star) &= \int P(y^\star|\mathbf{f}^\star) Q(\mathbf{f}^\star) \ \mathrm d \mathbf f ^\star\\ &= \mathcal{N}\big(y^\star; m(\mathbf{x}^\star) + \phi(\mathbf{x}^\star)^T \bm m\,, \bm \phi(\mathbf{x}^\star) ^T \bm S \bm \phi(\mathbf{x}^\star) + \sigma^2 \big)\,. \end{aligned} \end{equation} Although a Gaussian predictive distribution has the advantage of being tractable in closed form, the main drawback is that Gaussian distributions may not be flexible enough to explain complex data patterns.

Computational Complexity

If not for the linear regression approximation, the computational complexity of VIPs would match that of standard GPs, being cubic on training and quadratic on prediction. However, this approach is inapplicable to datasets with thousands of points.

The usage of the linear approximation simplifies the training complexity using stochastic optimization and mini-batches to be \(\mathcal{O}(BS^2D + BSC)\) where \(B\) denotes the size of the mini-batch, \(S\) the number of prior samples, \(D\) the data targets dimensionality, and \(C\) the cost of taking a prior sample. In most cases, the second term (linear on \(S\)) would be much smaller than the quadratic term. However, given that in practice \(S\) is relatively small, using more complex prior functions would lead to the linear term being the most significant.

Deep Variational Implicit Processes

Although sparse Gaussian process approximations alleviate the \(\mathcal O(N^{3})\) computational burden, a single kernel with a fixed similarity metric rarely captures the heterogeneous structure present in real–world data. One remedy is to compose multiple GPs in a feed-forward hierarchy, yielding a deep Gaussian process (DGP) (Damianou & Lawrence, 2013). Such compositions adapt the notion of similarity at each depth, but introduce strong nonlinearities and destroy analytic tractability—the predictive distribution need no longer be Gaussian. Approximate inference therefore relies on (i) expectation propagation with sparse inducing variables (Bui et al., 2016) or (ii) doubly-stochastic variational Bayes, which couples Monte–Carlo sampling with stochastic optimization (Salimbeni & Deisenroth, 2017).

Figure 3.2: Fully-connected Deep Variational Implicit Process (DVIP). An input vector \(\mathbf x\) is successively transformed by \(L\) hidden layers, each comprising \(H_l\) independent implicit-process (IP) units. Within a layer, every unit receives the entire output of the previous layer, so the composite mapping \(\mathbf x \mapsto \mathbf{f}^L(\mathbf x)\) forms a hierarchical, non-parametric prior whose representation capacity grows with depth.

The IP prior of Ma et al. (2019) can be elevated to a deep setting. A deep implicit process (DIP) concatenates \(L\) independent IP layers, thereby defining a hierarchical, non-parametric prior over latent functions; the resulting model is termed a deep variational implicit process (DVIP) once variational inference is applied. Figure 3.2 depicts the fully connected architecture considered in this work.

Let \(H_{l}\) be the width of layer \(l\). Denote by \(\mathbf f^{\,l}_{\cdot,n} \in \mathbb R^{H_{l}}\) the layer-\(l\) activation for input \(\mathbf x_{n}\), and write \(f^{\,l}_{h}(\cdot)\) for the \(h\)-th unit at that layer.

Definition 3.4 Deep implicit process. Fix \(L\in\mathbb N\) and widths \((H_{1},\dots,H_{L})\in(\mathbb N^{L}\). A deep implicit process is the collection of random variables \(\{f^{\,l}_{h,n} :\, l=1,\dots,L,\; h=1,\dots,H_{l},\; n=1,\dots,N\}\) defined recursively by \begin{equation} \mathbf f^{\,0}_{\cdot,n}=\mathbf x_{n},\qquad f^{\,l}_{h,n}=f^{\,l}_{h}(\mathbf f^{\,l-1}_{\,\cdot,n})\,, \end{equation} where each latent function \(f^{\,l}_{h}(\cdot) \sim \mathcal{IP}(g^{\,l}_{h}, P^{\,l}_{h})\) is an independent implicit process driven by generator \(g^{\,l}_{h}\) and latent distribution \(P^{\,l}_{h}\).

As in VIPs, GP approximations are considered for all the IPs in the deep IP prior defined above. This provides an expression for \(f^{l}_{h,n}\) given the previous layer’s output \(\mathbf{f}^{\, l-1}_{\cdot,n}\) and \(\bm{a}^l_h\), the coefficients of the linear model for the unit \(h\) at layer \(l\). Namely, \begin{equation} f^l_{h,n} = \bm \phi_h^l(\mathbf{f}^{\, l-1}_{\cdot,n})^T \bm{a}^l_h + m_{h,l}^\star(\mathbf{f}^{\, l-1}_{\cdot,n})\,, \end{equation} where \(\bm \phi_h^l(\cdot)\) and \(m_{h,l}^\star(\cdot)\) depend on the prior IP parameters \(\bm{\theta}_h^l\). To increase the flexibility of the model, latent Gaussian noise is added around each inner \(f^l_{h,n}\) with variance \(\sigma^2_{l, h}\) for \(l \in \{1,\dots,L-1\}\), that is, all inner layers. From now on \(\bm \sigma^2_l\) will denote the set of unit noises at layer \(l\), thus, \(\bm \sigma^2_l = \{\sigma^2_{l, h}\}_{h=1}^{H_l}\). As a result of these approaches, \(P(f^l_{h,n}|\mathbf{f}^{\, l-1}_{\cdot,n},\bm{a}^l_h)\) is Gaussian with a linear mean in terms of \(\bm{a}_h^l\): More precisely, \(\forall l \in \{1,\dots,L-1\}\), \begin{equation} P(f^l_{h,n}| \mathbf{f}^{\, l-1}_{\cdot,n},\bm{a}^l_h) = \mathcal{N}\left(f^{\, l}_{h,n}|\, m_{h,l}^\star(\mathbf{f}^{\, l-1}_{\cdot,n}) + \bm{\phi}_h^l(\mathbf{f}^{\, l-1}_{\cdot,n})^\text{T} \bm{a}^l_h\, , \sigma^2_{l,h} \right) \end{equation} and linearly deterministic for the last layer \begin{equation} f^L_{h,n} = m_{h,L}^\star(\mathbf{f}^{\, L-1}_{\cdot,n}) + \bm{\phi}_h^L(\mathbf{f}^{\, L-1}_{\cdot,n})^\text{T} \bm{a}^L_h\,. \end{equation} Let \(\bm{a}^l = \{\bm{a}_1^l,\ldots,\bm{a}^l_{H_l}\}\) and \(\mathbf{f}^{\, l} = \{\mathbf{f}^{\, l}_{\cdot,1},\ldots,\mathbf{f}_{\cdot,N}^{\, l}\}\). The joint distribution of all the variables (observed and latent) in DVIP is \begin{equation} \label{eq:dvip_model} P(\mathbf{y}, \{\mathbf{f}^{\, l}, \bm{a}^l\}_{l=1}^L) = \underbrace{\prod_{n=1}^N P(y_n | \mathbf{f}_{\cdot,n}^{\, L})}_{\text{Likelihood}} \underbrace{\prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} P(f^{\, l}_{h,n} | \mathbf{f}^{\, l-1}_{\cdot,n}, \bm{a}^l_h)P(\bm{a}^l_h)}_{\text{DVIP prior}}\,, \end{equation} where the prior for the linear coefficients is set to a standard Gaussian, i.e, \(P(\bm{a}^l_h) = \mathcal{N}(\bm{a}^l_h|\mathbf{0},\mathbf{I})\). It may seem that the prior assumes independence across points. However, dependencies arise by marginalizing out each \(\bm{a}_h^l\), which is tractable since the model is linear in \(\bm{a}_h^l\) (this marginalization can be seen as a convolution of Gaussians).

The posterior over the latent variables \(P\left(\{\mathbf{f}^{\, l}, \bm{a}^l\}_{l=1}^L|\mathbf{y} \right)\) is approximated using a similar approximation to that of deep GPs (Salimbeni & Deisenroth, 2017). This approximation has a fixed part and a tunable part, simplifying dependencies among layer units, but maintaining dependencies between layers: \begin{align} Q(\{\mathbf{f}^{\,l}, \bm{a}^l\}_{l=1}^L) = \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} P(f^l_{h,n} | \mathbf{f}^{\, l-1}_n, \bm{a}^l_h)Q(\bm{a}^l_h)\,,\quad Q (\bm{a}^l_h) = \mathcal{N}(\bm{a}^l_h| \bm{m}^l_h,\bm{S}_h^l)\,, \label{eq:approx_q_dvip} \end{align} where the factors \(P(f^l_{h,n} | \mathbf{f}^{\, l-1}_n, \bm{a}^l_h)\) are fixed to be the same factors as those in Equation \(\eqref{eq:dvip_model}\), and the factors \(Q(\bm{a}^l_h)\) are the ones being specifically tuned. Computing \(Q(\mathbf{f}^{\,L}_{\cdot,n})\), as specified by Equation \(\eqref{eq:approx_q_dvip}\) is intractable. However, one can easily sample from \(Q(\mathbf{f}^{\,L}_{\cdot,n})\) by propagating samples through the network.

Proposition 3.5. Let \(\bm \Omega = \{\bm{m}_h^l,\bm{S}_h^l: l=1,\ldots,L\,, h=1,\ldots,H_l\}\) denote the parameters of \(Q\) and \(\bm \Theta = \{\bm \theta_h^l: l=1,\ldots,L\,, h=1,\ldots,H_l\}\) the deep IP prior parameters. Then, a variational lower bound can be derived; at whose maximum the KL divergence between \(Q(\{\mathbf{f}^{\,l}, \bm{a}^l\}_{l=1}^L)\) and \(P\left(\{\mathbf{f}^{\,l}, \bm{a}^l\}_{l=1}^L|\cal{D} \right)\) is minimized \begin{equation} \mathcal{L}\left(\bm \Omega,\bm \Theta, \{\bm \sigma_l^2\}_{l=1}^{L-1}\right) = \sum_{n=1}^N \mathbb{E}_{Q(\mathbf{f}_{\cdot,n}^{\, L})} \big[ \log P\big(y_n| \mathbf{f}^{\, L}_{\cdot,n}\big)\big] -\sum_{l=1}^L \sum_{h=1}^{H_l} \text{KL}\big(Q(\bm{a}_h^l) | P(\bm{a}_h^l)\big)\,. \end{equation}

Proof

Consider the general definition of the ELBO in variational inference, i.e. \begin{equation} \mathcal{L}\left(\bm \Omega,\bm \Theta, \{\bm \sigma_l^2\}_{l=1}^{L-1}\right) = \mathbb{E}_{Q(\{\mathbf F^{\,l}, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{P\left(\mathbf y, \{\mathbf F^{\, l}, \mathbf{a}^l\}_{l=1}^L \right)}{Q(\{\mathbf F^{\, l}, \mathbf{a}^l\}_{l=1}^L)} \right]\,, \end{equation} using DVIPs model specification: \begin{align} P\big(\mathbf y, \{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L \big) &= \prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} P(f^l_{h,n} |\mathbf{a}^l_h)P(\mathbf a^l_h)\,,\\ Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L) &= \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} P(f^l_{h,n} |\mathbf{a}^l_h)Q(\mathbf a^l_h)\,, \end{align} the variational lower bound takes the following form: \begin{align} \mathcal{L} &= \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{P\left(\mathbf y, \{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L \right)}{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \right]\\ &= \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{\prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} \cancel{P(f^l_{h,n} |\mathbf{a}^l_h)}P(\mathbf a^l_h)}{\prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} \cancel{P(f^l_{h,n} |\mathbf{a}^l_h)}Q(\mathbf a^l_h)} \right]\\ &= \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{\prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \prod_{l=1}^L \prod_{h=1}^{H_l}P(\mathbf a^l_h)}{\prod_{l=1}^L \prod_{h=1}^{H_l} Q(\mathbf a^l_h)} \right]\,. \end{align} From this expression, the expectation can be split into two terms \begin{equation} \mathcal{L} = \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \right] + \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{ \prod_{l=1}^L \prod_{h=1}^{H_l}P(\mathbf a^l_h)}{\prod_{l=1}^L \prod_{h=1}^{H_l} Q(\mathbf a^l_h)} \right]\,. \end{equation} The logarithm in the first term does not depend on the regression coefficients \(\{\bm A^l \}_{l=1}^L\) and neither on \(\{\bm F^l \}_{l=1}^{L-1}\). On the other hand, the logarithm in the second term does not depend on \(\{\bm F^l \}_{l=1}^{L}\) . Thus, \begin{align} \mathcal{L} &= \mathbb{E}_{Q(\mathbf F^L)} \left[ \log \prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \right] + \mathbb{E}_{Q( \{\mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{ \prod_{l=1}^L \prod_{h=1}^{H_l}P(\mathbf a^l_h)}{\prod_{l=1}^L \prod_{h=1}^{H_l} Q(\mathbf a^l_h)} \right]\\ &=\sum_{n=1}^N \mathbb{E}_q \big[ \log P\big(y_n|\mathbf{f}^L_{\cdot,n}\big)\big] - \sum_{l=1}^L \sum_{h=1}^{H_l} \text{KL}\big(Q(\mathbf{a}_h^l) \big \rvert P(\mathbf{a}_h^l)\big)\,. \end{align}

In the above proposition, \(\mathrm{KL}(Q(\bm{a}_h^l)|P(\bm{a}_h^l))\) involves Gaussian distributions and can be evaluated analytically. The expectations \(\mathbb{E}_{Q(\mathbf{f}_{\cdot,n}^L)} \left[ \log P(y_n|\mathbf{f}^{\, L}_{\cdot,n})\right]\) are intractable. However, they can be approximated via Monte Carlo, using the reparameterization trick (Kingma & Welling, 2014). Moreover, \(\sum_{n=1}^N \mathbb{E}_{Q(\mathbf{f}_{\cdot,n}^{\, L})} \left[ \log P(y_n|\mathbf{f}^{\, L}_{\cdot,n})\right]\) can be approximated using mini-batches of size \(B\): \begin{equation} \label{eq:dvip_elbo_mini} \mathcal{L}\left(\bm \Omega,\bm \Theta, \{\bm \sigma_l^2\}_{l=1}^{L-1}\right) \approx \frac{N}{B}\sum_{b=1}^B \mathbb{E}_{Q(\mathbf{f}_{\cdot,b}^{\, L})} \big[ \log P\big(y_b| \mathbf{f}^{\, L}_{\cdot,b}\big)\big] -\sum_{l=1}^L \sum_{h=1}^{H_l} \text{KL}\big(Q(\bm{a}_h^l) | P(\bm{a}_h^l)\big)\,. \end{equation} The consequence is that Equation \(\eqref{eq:dvip_elbo_mini}\) can be maximized w.r.t. \(\bm \Omega\), \(\bm \Theta\) and \(\{\bm \sigma_l^2\}_{l=1}^{L-1}\), using stochastic optimization methods. Maximization w.r.t. \(\bm \Theta\) allows for prior adaptation to the observed data, which is a key factor when considering IP priors.

Sampling from the Marginal Posterior Approximation

The approximation of the variational ELBO in Equation \(\eqref{eq:dvip_elbo_mini}\) requires samples from \(Q(\mathbf{f}^L_{\cdot,n})\) for all the instances in a mini-batch. This marginal only depends on the variables of the inner layers and units \(f_{h,n}^l\) corresponding to the \(n^{th}\) instance. Therefore, samples from \(Q(\mathbf{f}^L_{\cdot,n})\) can be drawn by recursively propagating samples from the first to the last layer, using \(\mathbf{x}_n\) as the network input. Specifically, \(Q(f^l_{h,n}|\mathbf{f}^{\, l-1}_{\cdot,n})\) is Gaussian as \begin{equation} \begin{aligned} Q(f^l_{h,n}| \mathbf{f}^{\, l-1}_{\cdot,n}) &= \int P(f^l_{h,n}| \mathbf{f}^{\, l-1}_{\cdot,n}, \bm{a}^l_h) Q(\bm{a}_h^l) \, \mathrm{d} \bm{a}_h^l \\ &= \mathcal{N}\left(f^l_{h,n}|\, \hat{m}^l_{h,n}(\mathbf{f}^{\, l-1}_{\cdot,n}), \hat{v}^l_{h,n}(\mathbf{f}^{\, l-1}_{\cdot,n}) \right)\,. \end{aligned} \end{equation} with mean and variance: \begin{equation} \label{eq:pred_gauss_layer} \begin{aligned} \hat{m}^l_{h,n}(\mathbf{f}^{\, l-1}_{\cdot,n}) &= \bm{\phi}_h^l(\mathbf{f}^{\, l-1}_{\cdot,n})^\text{T} \bm{m}^l_h + m_{h,l}^\star(\mathbf{f}^{\, l-1}_{\cdot,n})\,,\\ \hat{v}^l_{h,n}(\mathbf{f}^{\, l-1}_{\cdot,n}) &= \bm{\phi}_h^l(\mathbf{f}^{\, l-1}_{\cdot,n})^\text{T} \bm{S}^l_h \bm{\phi}_h^l(\mathbf{f}^{\, l-1}_{\cdot,n}) + \sigma^2_{l,h}\mathbb{I}[l \neq L]\,, \end{aligned} \end{equation} where \(\bm{m}^l_h\) and \(\bm{S}^l_h\) are the parameters of \(Q(\bm{a}_h^l)\). This decomposition can be easily shown as the joint variational posterior is \begin{equation} \begin{aligned} Q(\{\mathbf F^l\}_{l=1}^L) &=\prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_L} Q(f^l_{h,n}|\mathbf{f}^{\, l-1}_{\cdot,n})\\ &= \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_L} \int P(f^l_{h,n}|\mathbf{f}^{\, l-1}_{\cdot,n}, \mathbf{a}^l_h) Q(\mathbf{a}_h^l) \, \mathrm d \mathbf{a}_h^l\\ &= \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_L} \mathcal{N}\left(f_{h,n}^l|\hat{m}_{h,n}^l(\mathbf{f}_{\cdot,n}^{l-1}), \hat{v}_{h,n}^l(\mathbf{f}_{\cdot,n}^{l-1})\right)\,. \end{aligned} \end{equation} As a result, the \(n^{th}\) marginal of the final layer depends only on the \(n^{th}\) marginals of the other layers. That is, \begin{equation} Q( f_n^L) = \int \prod_{l=1}^L \prod_{h=1}^{H_L} \mathcal{N}\left(f_{h,n}^l|\hat{m}_{h,n}^l(\mathbf{f}_{\cdot,n}^{\, l-1}), \hat{v}_{h,n}^l(\mathbf{f}_{\cdot,n}^{\, l-1})\right) \mathrm d \mathbf{f}_{\cdot,n}^{\, 1}, \dots, \mathrm d \mathbf{f}_{\cdot,n}^{\, L-1}\,. \end{equation} Initially, consider \(l=1\). Setting \(\hat{\mathbf{f}}^{\, 0}_{\cdot,n}=\mathbf{x}_n\) and generating a sample from \(Q(f^l_{h,n}|\hat{\mathbf{f}}^{\, 0}_{\cdot,n})\) for \(h=1,\ldots,H_l\). Let \(\hat{\mathbf{f}}^{\, l}_{\cdot,n}\) be that sample. Then, use \(\hat{\mathbf{f}}^{\, l}_{\cdot,n}\) as the input for the next layer. This process repeats for \(l=2,\ldots,L\), until obtaining \(\hat{\mathbf{f}}^{\, L}_{\cdot,n} \sim Q(\mathbf{f}^{\,L}_{\cdot,n})\). Figure 3.3 showcases the employed sampling procedure.

Figure 3.3: Sampling from DVIPs marginal posterior. Starting with an input \(\mathbf X_{n}\), compute the variational posterior of the first layer \(Q(\mathbf f^{\, 1}_{\cdot,n})\), and draw a latent function \(\mathbf f^{\, 1}_{\cdot,n}\). That sample is propagated through the next implicit layer, yielding a new posterior \(Q(\mathbf f^{\, 2}_{\cdot,n})\) and, ultimately, the predictive distribution over the observation \(\mathbf y_{n}\).

Making Predictions for New Instances

Let \(\mathbf{f}_{\cdot,\star}^L\) be the values at the last layer for the new instance \(\mathbf{x}_\star\). The distribution \(Q(\mathbf{f}_{\cdot,\star}^L)\) is approximated by propagating \(R\) Monte Carlo samples through the network. Then, \begin{align} \label{eq:dvip_pred} Q(\mathbf{f}_{\cdot,\star}^L) & \approx \frac{1}{R} \sum_{r=1}^R \prod_{h=1}^{H_L} \mathcal{N}\left(f_{h,\star}^L| \hat{m}_{h,\star}^L(\hat{\mathbf{f}}_{\cdot,\star}^{L-1,r}), \hat{v}_{h,\star}^L(\hat{\mathbf{f}}_{\cdot,\star}^{L-1,r})\right)\,, \end{align} where \(\hat{\mathbf{f}}_{\cdot,\star}^{L-1,r}\) is the \(r^{th}\) sample arriving at layer \(L-1\) and \(\hat{m}_{h,\star}^L(\cdot)\) and \(\hat{v}_{h,\star}^L(\cdot)\) are given by Equation \(\eqref{eq:pred_gauss_layer}\). Notice that Equation \(\eqref{eq:dvip_pred}\) is a mixture of \(R\) Gaussians, which is expected to be more flexible than the predictive distribution of VIPs. In the conducted experiments \(R\) is set to \(100\) for testing and to \(1\) for training. Computing, \(P(y_\star) = \mathbb{E}_Q[P(y_\star|\mathbf{f}_{\cdot,\star}^L)]\) is tractable in regression, and can be approximated using \(1\)-dimensional quadrature in binary and multi-class classification, as in the case of DGPs (Salimbeni & Deisenroth, 2017). Assume a Gaussian likelihood for regression problems, \(P(y_\star|\mathbf{f}_{\cdot,\star}^L) = \mathcal{N}(y_\star|\mathbf{f}_{\cdot,\star}^L\,, \bm \sigma^2_L \bm I)\), where \(\bm \sigma^2_L = (\sigma^2_{L,1},\dots,\sigma^2_{L,H_L})^T\), the expectation takes the form: \begin{equation} \begin{aligned} P(y_\star|\mathbf{x}_\star) &= \mathbb{E}_{Q(\mathbf{f}_{\cdot,\star}^L)}[P(y_\star| \mathbf{f}_{\cdot,\star}^L)] = \int Q(\mathbf{f}_{\cdot,\star}^L)P(y_\star| \mathbf{f}_{\cdot,\star}^L) \, d\mathbf{f}_{\cdot,\star}^L\\ &= \frac{1}{R} \sum_{r=1}^R \prod_{h=1}^{H_L} \mathcal{N}\Big(f_{h,\star}^L| \hat{m}_{h,\star}^L(\hat{\mathbf{f}}_{\cdot,\star}^{L-1,r})\,, \hat{v}_{h,\star}^L(\hat{\mathbf{f}}_{\cdot,\star}^{L-1,r}) + \sigma^2_{L,h}\Big)\,. \end{aligned} \end{equation}

Remark. Using a Gaussian Likelihood is equivalent to considering Gaussian noise at the final layer similarly to the noise employed in the inner layers. The noise on the Gaussian likelihood is denoted as \(\bm \sigma^2_L\) to highlight this fact.

Input Propagation

Inspired by the skip layer approach of deep convolutional networks, e.g. ResNet (He et al., 2016), and the addition of the original input to each layer on DGPs (Duvenaud et al., 2014; Salimbeni & Deisenroth, 2017), the same approach is implemented on DVIPs. For this, the previous input is added to the mean in Equation \(\eqref{eq:pred_gauss_layer}\) if the input and output dimensions of the layer are the same, except in the last layer where the mean does not change. In short, \begin{equation} \hat{m}^l_{h,n}(\mathbf{f}^{\, l-1}_{\cdot,n}) = \bm{\phi}_h^l(\mathbf{f}^{\, l-1}_{\cdot,n})^\text{T} \bm{m}^l_h + m_{h,l}^\star(\mathbf{f}^{\, l-1}_{\cdot,n}) + f_{h,n}^{l-1}\mathbb{I}[l \neq L] \quad \forall l = 1,\dots,L\,. \end{equation}

Computational Complexity

The computational cost at layer \(l\) in DVIPs is \(\mathcal{O}(BS^2H_l + BSC_l)\), where \(S\) is the number of samples from the prior IPs, \(B\) is the size of the mini-batch, \(H_l\) is the output dimension of the layer, and \(C_l\) is the cost of sampling from the prior. Notice the same number of prior samples is used for all the layers for simplicity. Moreover, only the diagonal of the covariance matrices is being computed, as in VIPs.

The first term of the cost is similar to that of a DGPs, which have a squared cost in terms of the number of inducing points (Bui et al., 2016; Hensman et al., 2013; Salimbeni & Deisenroth, 2017), notice that in that case, only the diagonal of the covariance matrix is computed; otherwise, the cost would be cubic. However, the number of prior IP samples \(S\) is smaller than the typical number of inducing points in DGPs. In fact, the experiments carried out \(S = 20\), as suggested for VIPs (Ma et al., 2019). Considering a DVIP with \(L\) layers, the total cost is in \(\mathcal{O}(BS^2(H_1 + \dots + H_L) + BS(C_1 + \dots + C_L))\). Where the second term is negligible unless heavy models are used as prior. Experiments show that, despite the generation of the prior IP samples using BNNs, DVIPs are faster compared to DGPs, and the gap between the two becomes bigger as \(L\) increases.

The relationship between GPs and IPs, such as BNNs, has been extensively studied recently. A single-layer BNN with cosine activations and infinite width is equivalent to a GP with RBF kernel (Hensman et al., 2017). A deep BNN is equivalent to a GP with a compositional kernel (Cho & Saul, 2009), as shown in (Lee et al., 2018). These methods make it possible to create expressive kernels for GPs. An inverse reasoning is used in (Flam-Shepherd et al., 2017) where the properties of a GP prior are encoded into the weight priors of a BNN.

As described in Section 3.2.1, VIPs (Ma et al., 2019) arise from the treatment of BNNs as instances of IPs. For this, an approximate GP is used to assist inference. Specifically, a prior GP is built with mean and covariance function given by the prior IP, a BNN. VIPs can make use of the more flexible IP prior, whose parameters can be inferred from the data, resulting in improved results over GPs (Ma et al., 2019). However, it is limited by the fact that VIP’s predictive distribution is Gaussian, as a consequence of the GP approximation. Deep variational implicit processes (DVIPs) overcome this problem by providing a non-Gaussian predictive distribution. It is also expected to lead to a more flexible model with better calibrated uncertainty estimates.

Besides VIP there are other methods that have tried to make inference using IPs. In (Sun et al., 2019) functional Bayesian neural networks (fBNN) are introduced, where a second IP is used to approximate the posterior of the first IP. This is a more flexible approximation than that of VIP. However, because both the prior and the posterior are implicit, the noisy gradient of the variational ELBO is intractable and has to be approximated. For this, a spectral gradient estimator is used (Shi et al., 2018). To ensure that the posterior IP resembles the prior IP in data-free regions, fBNN relies on uniformly covering the input space. In high-dimensional spaces this can lead to poor results. As a consequence of the spectral gradient estimator, fBNN cannot tune the prior IP parameters to the data. In the particular case of a GP prior, fBNN simply maximizes the marginal likelihood of the GP w.r.t. the prior parameters. However, a GP prior implies a GP posterior. This questions using a second IP for posterior approximation.

Sparse implicit processes (SIPs) consider an inducing point based approach for approximate inference in the context of IPs (Santana et al., 2021). SIP does not have the limitations of either VIP or fBNN, given that it produces flexible predictive distributions (Gaussian mixtures) and it can adjust the prior parameters to the data. SIP, however, relies on a classifier to estimate the KL-term in the variational ELBO, which adds computational cost. SIP’s improvements over VIP are orthogonal to those of DVIP over VIP and, in principle, SIP could also be used as the building blocks of DVIP, leading to even better results.

Functional variational inference (FVI, (Ma & Hernández-Lobato, 2021)) minimizes the KL-divergence between stochastic processes for approximate inference using IPs. Specifically, between the model’s IP posterior and a second IP, such as fBNN. This is done efficiently by approximating first the IP prior using a stochastic process generator (SPG). Then, a second SPG is used to efficiently approximate the posterior of the previous SPG.Both SPGs share key features that make this an easy task. However, FVI is also limited, like fBNN, in the sense that it cannot adjust the prior to the data, which questions its practical utility.

As shown in (Santana et al., 2021), adjusting the prior IP to the observed data is key for accurate predictions. This questions using fBNN and FVI as the building blocks of a model using deep IP priors on the target function. Moreover, these methods do not consider deep architectures such as the one shown in Figure 3.2. For this reason, comparative results focus on comparing with VIP, as DVIP generalizes VIP.

The concatenation of IPs with the goal of describing priors over functions has not been studied previously in the literature. However, the concatenation of GPs, following a multi-layer architecture and resulting in deep GPs (DGPs), has received a lot of attention from the community (Bui et al., 2016; Cutajar et al., 2017; Havasi et al., 2018; Lawrence & Moore, 2007; Salimbeni & Deisenroth, 2017; Yu et al., 2019). In principle, DVIP is a generalization of DGPs in the same way as IPs generalize GPs. Namely, the IP prior of each layer’s unit can simply be a GP. Samples from such a GP prior can be efficiently obtained using, e.g., a 1-layer BNN with cosine activation functions that is wide enough (Cutajar et al., 2017; Rahimi & Recht, 2007). DGPs are hence restricted to GP priors, while DVIP has the extra flexibility of considering a wider range of IP priors that need not be GPs. As a matter of fact, in the conducted experiments, DVIP significantly outperforms DGPs in image-related datasets, where using specific IP priors based on convolutional neural networks, can give a significant advantage. Experiments also show that DVIP is faster to train than a DGP, and the difference becomes larger with the number of layers \(L\).

Conducted Experiments

DVIPs have been evaluated on several tasks, including time series interpolation, regression, and classification problems. Unless stated otherwise, the number of linear regression coefficients is set to \(S = 20\) and a BNN as the IP prior for each unit for all the layers. These BNNs have two hidden layers of \(10\) units each with tanh as the activation function, as in (Ma et al., 2019). The prior mean and variance of each weight and bias in the BNN have been tuned. As no regularizer is used for the prior parameters, the prior mean and variances are constrained to be the same in a layer of the BNN. This configuration avoids overfitting and leads to improved results. See Section 3.5.6.1 for further details on the matter.

To speed up computations, at each layer \(l\) the generative function that defines the IP prior is shared across units. That is, the function \(g_h^l\) is the same for every unit \(h\) in that layer. As a consequence, the prior IP samples only need to be generated once per layer, as in (Ma et al., 2019). This assumption is similar to that of DGPs, which assume shared GP prior parameters for each layer (Salimbeni & Deisenroth, 2017). Under these assumptions and simplifications, each VIP layer contains \(12\) tunable parameters in its prior, which are the mean and variance of each weight and bias of the three layers in its BNN.

To evaluate the performance of DVIPs, the model is compared with VIPs (Ma et al., 2019) and DGPs, closely following (Salimbeni & Deisenroth, 2017). In DGPs \(100\), shared inducing points in each layer are considered. The employed optimization algorithm is ADAM (Kingma & Ba, 2015) with a learning rate \(10^{-3}\), as in (Ma et al., 2019). In DGPs, the learning rate is set to \(10^{-2}\), as in (Salimbeni & Deisenroth, 2017).

Following (Salimbeni & Deisenroth, 2017), unless indicated otherwise, in DVIPs and DGPs the input dimension is used as the layer dimensionality for all the inner layers, i.e \(H_l = M\), for \(l=1,\dots,L-1\). In DGPs the kernel employed is RBF with ARD (Rasmussen & Williams, 2006). The batch size is \(100\). All methods are trained for \(150\ 000\) iterations unless indicated otherwise.

The marginal likelihood regularizer described in (Ma et al., 2019) is not employed in VIPs nor VIP layers since the authors indicated that they did not use it in practice. Black-box \(\alpha\)-divergences are used for VIPs with \(\alpha = 0.5\), as suggested in (Ma et al., 2019).

Results are not compared with fBNNs nor FVIs, described in Section 3.4, because these methods cannot tune the prior IP parameters to the data nor do they consider deep architectures such as the one shown in Figure 3.2 (right). The source code can be accessed at https://github.com/Ludvins/2023-ICLR-DVIP.

Regression UCI benchmarks

DVIPs are compared with DGPs and VIPs on 8 different regression datasets from the UCI Repository (Dua & Graff, 2019). Following common practice, the performance is validated on each dataset using 20 different train-test partitions of the data with \(10\%\) test size (Hernández-Lobato & Adams, 2015; Salimbeni & Deisenroth, 2017).

Both DVIPs and DGPs are tested using 2, 3, 4, and 5 layers. Moreover, VIPs, which are equivalent to DVIPs with \(L=1\), are tested with the same prior as DVIPs and using a bigger BNN of 200 units per layer. Furthermore, results are compared with a single sparse GPs, which is equivalent to DGPs for \(L=1\). Figure 3.4 shows the results obtained in terms of the negative test log-likelihood. Exact results are found in Table tab:full_uci.

Results show that DVIPs with at least 3 layers perform best on 4 out of the 8 datasets (Boston, Energy, Concrete and Power), having comparable results on Winered and Naval (all methods have zero RMSE on this dataset). DGPs perform best on 2 datasets (Protein and Kin8nm), but the differences are small. Adding more layers in DVIPs does not lead to overfitting and it gives similar and often better results (more notably on larger datasets: Naval, Protein, Power and Kin8nm). DVIPs also perform better than VIPs most of the time. By contrast, using a more flexible BNN prior in VIPs (i.e., 200 units) does not improve results. Figure 3.5 shows the training time in seconds of each method. These execution times show that DVIPs are faster than DGPs and faster than VIPs with the 200 units BNN prior. Summing up, DVIPs achieve similar results to those of DGPs, but at a smaller cost.

Figure 3.4: Negative test log–likelihood (NLL) on eight UCI regression benchmarks. Each panel corresponds to one dataset (Boston, Energy, Concrete, WineRed, Power, Naval, Protein, and Kin8nm). Dots mark the mean NLL across 20 random train–test splits; horizontal bars indicate one standard error. Colors distinguish single–layer models (blue), the proposed method (green), and deep GP baselines (orange). The dashed vertical line in each panel shows VIP as baseline for reference. Lower values (further to the left) indicate better predictive performance.
Figure 3.5: CPU training time on eight UCI regression benchmarks. Each panel corresponds to one dataset (Boston, Energy, Concrete, WineRed, Power, Naval, Protein, and Kin8nm). Dots mark the mean NLL across 20 random train–test splits; horizontal bars indicate one standard error. Colors distinguish single–layer models (blue), the proposed method (green), and deep GP baselines (orange). The dashed vertical line in each panel shows VIP as baseline for reference. Lower values (further to the left) indicate lower training time.

Interpolation Results

As part of the empirical evaluation, experiments on the CO2 time-series dataset (https://scrippsco2.ucsd.edu) are carried out. This dataset consists of \(\text{CO}_2\) measurements from the Mauna Loa Observatory, Hawaii, in 1978. The dataset is split into five consecutive and equal parts, with the 2nd and 4th parts used as missing testing data. All models are trained for \(100\ 000\) iterations. Figure 3.6 shows the predictive distribution of DVIPs and DGPs with \(L=2\) on the data. Results show that DVIPs can capture the data trend in the missing gaps. For DVIPs, samples from the learned prior are shown, which are very smooth. By contrast, a DGP with RBF kernels fails to capture the data trend, leading to mean reversion and over-estimation of the prediction uncertainty. Thus, the BNN prior considered by DVIP could be a better choice here. This issue of DGPs can be overcome using compositional kernels (Duvenaud et al., 2014), but that requires using kernel search algorithms.

image
image

Figure 3.6: Missing–value interpolation on the Mauna Loa CO\(_2\) time series. Monthly CO\(_2\) concentrations (1958–2022) are divided into five consecutive, equal–length segments; the 2\(^\text{nd}\) and 4\(^\text{th}\) segments (black crosses) are withheld as test gaps. Both models are trained for 100 000 iterations and use \(L=2\) layers. Upper: the proposed DVIP. The dashed red curve is the predictive mean, the orange band shows \(\pm2\sigma\), and faint red lines are samples from the learned (smooth) prior. Lower: a DGP with RBF kernels. The model suffers from mean-reversion in the gaps between the training data.

Large Scale Regression

Each method is evaluated on 3 large regression datasets. First, the Year dataset (UCI) (Bertin-Mahieux, 2011) with \(515\ 345\) instances and \(90\) features, where the original train/test splits are used. Second, the US flight delay (Airline) dataset (Dutordoir et al., 2020; Hensman et al., 2017), where following (Salimbeni & Deisenroth, 2017) the first \(700\ 000\) instances are used for training and the next \(100\ 000\) for testing. A total of \(8\) features are considered: month, day of month, day of week, plane age, air time, distance, arrival time and departure time. Lastly, data recorded in January 2015 from the Taxi dataset is used. In this dataset \(10\) attributes are considered: time of day, day of week, day of month, month, pickup latitude, pickup longitude, drop-off longitude, drop-off latitude, trip distance and trip duration. Trips with a duration lower than \(10\) seconds and larger than \(5\) hours are removed as in (Salimbeni & Deisenroth, 2017), leaving \(12\ 695\ 289\) instances. Results are averaged over \(20\) train/test splits with \(90\) % and \(10\) % of the data. Here, each method is trained for \(500\ 000\) iterations. The results obtained are shown in Table 3.1. The last column shows the best result by DGPs, which is achieved for \(L=3\) on each dataset.

It can be observed that DVIPs outperform VIPs on all datasets, and on Airline and Taxi, the best methods are DVIPs. In Taxi, SGPs and DGPs give similar results, while DVIPs improve over VIPs. The best method on Year, however, is DGPs. Since the difference between DVIPs and DGPs is found in the prior, one could hypothesize that GPs with an RBF kernel may be a better prior on this dataset than the BNN IP prior considered by DVIPs. Therefore, since DVIPs generalize DGPs, DVIPs using GP priors should give similar results to those of DGPs on Year. This is further analyzed in Section 3.5.6.4.

Table 3.1: Large–scale regression: Year, Airline, and Taxi datasets. Three error metrics—root mean-squared error (RMSE), negative log-likelihood (NLL), and continuous ranked probability score (CRPS)— are reported for single-layer baselines (SGP and VIP), deep variational implicit processes (DVIP) of depth \(L=2\text{--}5\), and the best deep GP \(L = 3\). Lower values indicate better performance; The best metric is highlighted in purple; the second‐best in teal.
RMSE Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5 Best DGP
Year \(\bm{9.15} \pm \bm{0.01}\) \(10.27 \pm 0.01\) \(9.61 \pm 0.03\) \(9.34 \pm 0.02\) \(9.30 \pm 0.03\) \(9.27 \pm 0.03\) \(\bm{8.94} \pm \bm{0.03}\)
Airline \(38.61 \pm 0.05\) \(38.90 \pm 0.06\) \(37.96 \pm 0.03\) \(37.91 \pm 0.05\) \(\bm{37.83} \pm \bm{0.03}\) \(\bm{37.80} \pm \bm{0.05}\) \(37.95 \pm 0.04\)
Taxi \(554.22 \pm 0.32\) \(554.60 \pm 0.19\) \(549.28 \pm 0.59\) \(\bm{531.42} \pm \bm{1.59}\) \(547.33 \pm 1.03\) \(\bm{538.94} \pm \bm{2.23}\) \(552.90 \pm 0.33\)
NLL Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5 Best DGP
Year \(\bm{3.62} \pm \bm{0.00}\) \(3.74 \pm 0.00\) \(3.68 \pm 0.00\) \(3.64 \pm 0.00\) \(3.64 \pm 0.00\) \(3.63 \pm 0.00\) \(\bm{3.59 \pm 0.00}\)
Airline \(5.10 \pm 0.00\) \(5.11 \pm 0.00\) \(5.08 \pm 0.00\) \(\bm{5.07} \pm \bm{0.00}\) \(\bm{5.07} \pm \bm{0.00}\) \(\bm{5.06 \pm 0.00}\) \(\bm{5.07} \pm \bm{0.00}\)
Taxi \(7.73 \pm 0.00\) \(7.73 \pm 0.00\) \(7.72 \pm 0.00\) \(\bm{7.69 \pm 0.00}\) \(7.72 \pm 0.00\) \(\bm{7.70} \pm \bm{0.00}\) \(7.73 \pm 0.00\)
CRPS Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5 Best DGP
Year \(\bm{4.83} \pm \bm{0.01}\) \(5.45 \pm 0.01\) \(5.13 \pm 0.02\) \(4.96 \pm 0.01\) \(4.93 \pm 0.01\) \(4.91 \pm 0.02\) \(\bm{4.68 \pm 0.01}\)
Airline \(17.90 \pm 0.05\) \(17.93 \pm 0.04\) \(17.53 \pm 0.05\) \(17.54 \pm 0.07\) \(17.51 \pm 0.05\) \(\bm{17.45 \pm 0.04}\) \(\bm{17.47 \pm 0.03}\)
Taxi \(283.79 \pm 0.19\) \(284.22 \pm 0.20\) \(282.09 \pm 0.32\) \(\bm{274.65 \pm 0.68}\) \(281.28 \pm 0.44\) \(\bm{277.60} \pm \bm{0.90}\) \(282.99 \pm 0.21\)

Image Classification

Binary classification experiments were carried out using the Rectangles dataset (Salimbeni & Deisenroth, 2017) and the multi-class dataset MNIST (Deng, 2012). Each dataset has \(28 \times 28\) pixel images. The Rectangles dataset has \(12\ 000\) images of a (non-square) rectangle. The task is to determine if the height is larger than the width. For this dataset, a probit likelihood is employed for all the models. The MNIST dataset has \(70\ 000\) images of handwritten digits. The labels correspond with each digit. In this case, the robust-max multi-class likelihood (Hernández-Lobato et al., 2011) is used. Given the size of Rectangles, \(20\ 000\) iterations are enough to ensure convergence. Results on this dataset are averaged over \(10\) different random seeds, but the standard errors are omitted as they are always lower than \(0.1\). The original train-test splits are used for both datasets. Examples of images from these datasets are shown in Figure 3.7.

image image
Figure 3.7: Example inputs from the image–classification benchmarks. Left: a \(28\times 28\) grayscale MNIST digit belonging to class “3”. Right: a \(28\times 28\) sample from the Rectangles dataset, where the inner dark rectangle is taller than it is wide. Pixel intensities are rendered in grayscale for clarity.

Critically, DVIP’s capability to use more flexible priors is exploited. Specifically, in the first layer, a convolutional NN (CNN) prior with two layers of 4 and 8 channels, respectively, is employed. The inner dimensions of DVIPs and DGPs are fixed to \(30\), as in (Salimbeni & Deisenroth, 2017). No input propagation is used in the first layer. The results obtained are shown in Table 3.2. These show that DVIPs obtain much better results than those of DGPs and VIPs. In particular, DVIPs increase the accuracy by \(11\%\) on Rectangles compared to DGPs, probably as a consequence of the CNN prior considered in the first layer of the network being more suited for image-based datasets.

Table 3.2: Image–classification benchmarks: MNIST and Rectangles. Test accuracy (in %), average log–likelihood (higher is better, i.e. less negative), and area under the ROC curve (AUC, Rectangles only) are reported for single–layer baselines (SGP and VIP), deep variational implicit processes with \(L=2\text{--}5\) layers (DVIP \(k\)), and deep Gaussian processes (DGP 2–3) from Salimbeni & Deisenroth (2017). Each entry shows the mean \(\pm\) one standard error. The best metric is highlighted in purple, the second best in teal.
MNIST Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DGP 2 DGP 3
Acc. (%) \(96.25 \pm 0.04\) \(97.99 \pm 0.03\) \(\bm{98.39} \pm \bm{0.05}\) \(\bm{98.36} \pm \bm{0.04}\) \(97.75 \pm 0.04\) \(97.86 \pm 0.05\)
NLL \(0.146 \pm 0.00\) \(0.144 \pm 0.00\) \(\bm{0.074} \pm \bm{0.00}\) \(0.080 \pm 0.00\) \(0.082 \pm 0.00\) \(\bm{0.072} \pm \bm{0.00}\)
Rectangles Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5 DGP 3
Acc. (%) \(72.54 \pm 0.14\) \(85.63 \pm 0.18\) \(87.84 \pm 0.20\) \(\bm{88.21} \pm \bm{0.12}\) \(\bm{87.43} \pm \bm{0.20}\) \(86.49 \pm 0.17\) \(75.16 \pm 0.16\)
NLL \(0.518 \pm 0.00\) \(0.348 \pm 0.00\) \(\bm{0.306} \pm \bm{0.00}\) \(\bm{0.295} \pm \bm{0.00}\) \(0.309 \pm 0.00\) \(0.320 \pm 0.00\) \(0.470 \pm 0.00\)
AUC \(0.828 \pm 0.00\) \(0.930 \pm 0.00\) \(\bm{0.950} \pm \bm{0.00}\) \(\bm{0.953} \pm \bm{0.00}\) \(0.947 \pm 0.00\) \(0.939 \pm 0.00\) \(0.858 \pm 0.00\)

Large Scale Classification

Each method is evaluated on two massive binary datasets: SUSY and HIGGS, with \(5.5\) million and \(10\) million instances, respectively. These datasets contain Monte Carlo physics simulations to detect the presence of the Higgs boson and super-symmetry (Baldi et al., 2014). The original train/test splits of the data are used, and trained for \(500\ 000\) iterations. The AUC metric is reported for comparison with (Baldi et al., 2014) and (Salimbeni & Deisenroth, 2017). Results are shown in Table 3.3, averaged over 10 different random seed initializations. In the case of DGPs, the best results are shown, which correspond to \(L=4\) and \(L=5\), respectively. DVIP achieves the highest performance on SUSY (AUC of \(0.8756\)) which is comparable to that of DGPs (\(0.8751\)) and to the best reported results in (Baldi et al., 2014). Namely, shallow NNs (NN, \(0.875\)), deep NN (DNN, \(0.876\)) and boosted decision trees (BDT, \(0.863\)). On HIGGS, despite seeing a steady improvement over VIP by using additional layers, the performance is worse than that of DGP (AUC \(0.8324\)). Again, GPs with an RBF kernel may be a better prior here, and DVIP using inducing points and a GP prior should give similar results to those of DGP. However, the high computational cost of approximately sampling from the GP prior will make this too expensive.

Table 3.3: Large–scale binary classification: SUSY and HIGGS. Single–layer baselines (SGP and VIP), DVIP with \(L=2\text{--}5\) layers (DVIP \(k\)), and the best deep Gaussian process from Salimbeni & Deisenroth (2017) (DGP 4 for SUSY and DGP 5 for HIGGS). Metrics are test accuracy (in %), negative log–likelihood (NLL; lower is better), and area under the ROC curve (AUC). Entries show the mean \(\pm\) one standard error. Best is purple; second best is teal.
SUSY Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5 DGP 4
Acc. (%) \(79.75 \pm 0.02\) \(78.68 \pm 0.02\) \(80.11 \pm 0.03\) \(80.13 \pm 0.01\) \(\bm{80.22} \pm \bm{0.01}\) \(\bm{80.24} \pm \bm{0.02}\) \(80.06 \pm 0.01\)
NLL \(0.436 \pm 0.00\) \(0.456 \pm 0.00\) \(\bm{0.429} \pm \bm{0.00}\) \(\bm{0.429} \pm \bm{0.00}\) \(\bm{0.427} \pm \bm{0.00}\) \(\bm{0.427} \pm \bm{0.00}\) \(0.432 \pm 0.00\)
AUC \(0.872 \pm 0.00\) \(0.857 \pm 0.00\) \(0.874 \pm 0.00\) \(0.874 \pm 0.00\) \(\bm{0.875} \pm \bm{0.00}\) \(\bm{0.875} \pm \bm{0.00}\) \(0.875 \pm 0.00\)
SUSY Single-layer Ours (Salimbeni & Deisenroth, 2017)
SGP VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5 DGP 5
Acc. (%) \(69.95 \pm 0.03\) \(57.42 \pm 0.03\) \(66.09 \pm 0.02\) \(69.85 \pm 0.02\) \(70.43 \pm 0.01\) \(\bm{72.01} \pm \bm{0.02}\) \(\bm{74.92} \pm \bm{0.01}\)
NLL \(0.573 \pm 0.00\) \(0.672 \pm 0.00\) \(0.611 \pm 0.00\) \(0.575 \pm 0.00\) \(0.565 \pm 0.00\) \(\bm{0.542} \pm \bm{0.00}\) \(\bm{0.501} \pm \bm{0.00}\)
AUC \(0.769 \pm 0.00\) \(0.624 \pm 0.00\) \(0.719 \pm 0.00\) \(0.770 \pm 0.00\) \(0.778 \pm 0.00\) \(\bm{0.796} \pm \bm{0.00}\) \(\bm{0.832} \pm \bm{0.00}\)

Ablation Studies

The results presented in the previous sections highlight the overall performance and calibration properties of Deep Variational Implicit Processes across multiple benchmarks. To better understand the internal behavior of the model, a closer analysis of its components is conducted through a set of targeted ablation experiments. These experiments quantify the individual contribution of each architectural and inference choice to the final predictive performance, shedding light on which design aspects are most critical for stability and accuracy.

Impact of the Constrained Prior

Figure 3.8 shows, on a toy problem, the obtained predictive distributions and learned prior samples of VIP, for \(\alpha = 0\) (see Section for further information about \(\alpha\)-divergences). This corresponds to a 1-layer particular case of DVIP. Two cases are considered: (1) using a full unconstrained prior BNN, in which the mean and variance of each weight and bias can be independently tuned (shown above), and (2) the considered constrained BNN in which prior means and variances are shared across layers (shown below). The second approach considerably reduces the number of parameters in the prior, and it generates smoother prior functions. The predictive distribution is also smoother than when the prior is unconstrained. However, despite providing better results by avoiding overfitting, there might be datasets where using the full unconstrained parameterization of the BNN leads to improved results. For example, in problems where a more flexible prior may be beneficial to obtain good generalization properties on unseen data.

image
image

Figure 3.8: Toy regression: effect of the prior on predictive behavior. Training inputs (blue dots) are fitted with a two–layer BNN. Top: model trained with a full (unconstrained) weight prior. Bottom: model trained with a constrained weight prior that encourages smoother functions. In each panel the dashed orange curve is the predictive mean, the shaded band shows \(\pm2\) predictive standard deviations, and faint red curves are draws from the prior. The constrained prior yields a markedly smoother posterior and more moderate uncertainty outside the data range compared with the full prior.

Consider the Boston and the Energy datasets. To highlight that prior over-fitting is a dataset-specific matter, Table 3.4 shows the obtained results using an unconstrained prior and a constrained prior for VIP on the aforementioned datasets. As one may see, significantly over-fitting is taking place on Boston. More precisely, the training error improves a lot when using the unconstrained prior. By contrast, test error and other performance metrics deteriorate on the test set. In Energy, however, a more flexible prior results in better test RMSE and CRPS, but worse test log-likelihood.

Table 3.4: Effect of the BNN prior on VIP performance (Boston and Energy). Unconstrained and constrained Bayesian‐NN priors are compared within the VIP framework (\(\alpha=0\)). Shown are training and test values for root mean‐squared error (RMSE), negative log‐likelihood (NLL), and continuous ranked probability score (CRPS). The unconstrained prior fits the training data extremely well but generalizes poorly—most notably on Boston, where the test NLL explodes. Imposing the constraint regularizes the prior, yielding lower test RMSE, NLL, and CRPS on both datasets, at the cost of higher training error. Numbers are means \(\pm\) one standard error over 20 random splits.
Unconstrained RMSE Train RMSE Test NLL Train NLL Test CRPS Train CRPS Test
Boston \(0.05 \pm 0.00\) \(5.85 \pm 0.14\) \(-1.21 \pm 0.19\) \(5126.07 \pm 274.79\) \(0.03 \pm 0.00\) \(4.31 \pm 0.08\)
Energy \(0.14 \pm 0.00\) \(0.57 \pm 0.01\) \(-0.51 \pm 0.01\) \(6.52 \pm 0.42\) \(0.079 \pm 0.00\) \(0.36 \pm 0.01\)
Constrained RMSE Train RMSE Test NLL Train NLL Test CRPS Train CRPS Test
Boston \(3.90 \pm 0.02\) \(4.73 \pm 0.24\) \(2.77 \pm 0.00\) \(23.03 \pm 0.07\) \(2.06 \pm 0.01\) \(2.40 \pm 0.08\)
Energy \(2.35 \pm 0.03\) \(2.57 \pm 0.08\) \(2.27 \pm 0.01\) \(2.07 \pm 0.02\) \(1.28 \pm 0.01\) \(1.27 \pm 0.04\)

Impact of the Number of Prior Samples

The impact of the number of prior samples \(S\) has on the performance and training time of the proposed method is explored. Table 3.5 shows the results obtained using DVIP with 3 layers on Protein and Power datasets (UCI). These results show how increasing the number of prior samples produces better results in terms of performance compared to using lower values of \(S\). However, it is important to consider that this value scales quadratically the computational complexity of evaluating the ELBO, heavily influencing the training cost of the model.

Table 3.5: Impact of the number of prior samples \(S\) on three–layer DVIP. For the UCI Power and Protein datasets test root mean–squared error (RMSE), negative log–likelihood (NLL), continuous ranked probability score (CRPS), and total training time (CPU seconds) as the number of prior samples varies from \(S=10\) to \(S=50\) are reported. Increasing \(S\) yields modest yet consistent gains in predictive metrics, at the cost of an approximately linear rise in computation. Entries are means \(\pm\) one standard error over 20 random splits.
Power \(S = 10\) \(S = 20\) \(S = 30\) \(S = 40\) \(S = 50\)
RMSE \(4.03 \pm 0.04\) \(4.01 \pm 0.04\) \(3.94 \pm 0.04\) \(3.92 \pm 0.04\) \(3.94 \pm 0.04\)
NLL \(2.81 \pm 0.01\) \(2.81 \pm 0.00\) \(2.79 \pm 0.01\) \(2.78 \pm 0.01\) \(2.79 \pm 0.01\)
CRPS \(2.19 \pm 0.01\) \(2.18 \pm 0.01\) \(2.14 \pm 0.01\) \(2.11 \pm 0.01\) \(2.13 \pm 0.01\)
CPU Time (s) \(2693 \pm 19\) \(2806 \pm 22\) \(3152 \pm 20\) \(3451 \pm 63\) \(3742 \pm 55\)
Protein \(S = 10\) \(S = 20\) \(S = 30\) \(S = 40\) \(S = 50\)
RMSE \(4.53 \pm 0.01\) \(4.40 \pm 0.01\) \(4.28 \pm 0.01\) \(4.27 \pm 0.01\) \(4.21 \pm 0.01\)
NLL \(2.92 \pm 0.00\) \(2.90 \pm 0.00\) \(2.87 \pm 0.00\) \(2.86 \pm 0.00\) \(2.85 \pm 0.00\)
CRPS \(2.52 \pm 0.00\) \(2.43 \pm 0.00\) \(2.36 \pm 0.00\) \(2.34 \pm 0.00\) \(2.31 \pm 0.00\)
CPU Time (s) \(2334 \pm 28\) \(2734 \pm 19\) \(3616 \pm 11\) \(3727 \pm 35\) \(4330 \pm 52\)

Robustness over the Prior BNN Architecture

In this section, the impact that changing the prior BNN structure has over the performance of the proposed method is studied. Table 3.6 shows the results obtained using DVIP with 2, 3, and 4 layers on Protein and Power datasets (UCI) with two different BNN structures, 2 hidden layers with 20 units (20-20) and 3 hidden layers with 10 units (10-10-10). These results (that are to be compared with the original ones obtained using 2 hidden layers and 10 units on Table tab:full_uci) show that changing the structure of the BNN does not heavily affect the obtained results, given that it is capable of learning similar function distributions with the different BNN architectures.

Table 3.6: Influence of the prior BNN architecture on DVIP performance. Two priors—a deeper but narrower three–layer network (10–10–10) and a wider two–layer network (20–20)—within DVIPs of depth \(L=2\text{--}4\) are compared. Test root mean–squared error (RMSE), negative log–likelihood (NLL), and continuous ranked probability score (CRPS) are reported for the UCI Power and Protein datasets. Performance differences across priors are modest: the wider 20–20 prior slightly improves several metrics on Power, while results on Protein remain largely unchanged. Entries are means \(\pm\) one standard error over 20 random splits.
BNN 10-10-10 BNN 20-20
Power DVIP \(2\) DVIP \(3\) DVIP \(4\) DVIP \(2\) DVIP \(3\) DVIP \(4\)
RMSE \(4.02 \pm 0.04\) \(3.94 \pm 0.04\) \(3.97 \pm 0.04\) \(4.02 \pm 0.04\) \(3.99 \pm 0.04\) \(3.93 \pm 0.04\)
NLL \(2.81 \pm 0.01\) \(2.79 \pm 0.00\) \(2.80 \pm 0.01\) \(2.81 \pm 0.01\) \(2.80 \pm 0.01\) \(2.79 \pm 0.01\)
CRPS \(2.18 \pm 0.01\) \(2.13 \pm 0.01\) \(2.15 \pm 0.01\) \(2.18 \pm 0.01\) \(2.16 \pm 0.01\) \(2.13 \pm 0.01\)
Protein DVIP \(2\) DVIP \(3\) DVIP \(4\) DVIP \(2\) DVIP \(3\) DVIP \(4\)
RMSE \(4.57 \pm 0.01\) \(4.37 \pm 0.01\) \(4.29 \pm 0.01\) \(4.55 \pm 0.01\) \(4.43 \pm 0.01\) \(4.31 \pm 0.01\)
NLL \(2.93\pm 0.00\) \(2.89 \pm 0.00\) \(2.87 \pm 0.00\) \(2.93 \pm 0.00\) \(2.90 \pm 0.00\) \(2.87 \pm 0.00\)
CRPS \(2.55 \pm 0.00\) \(2.42 \pm 0.00\) \(2.36 \pm 0.00\) \(2.54 \pm 0.00\) \(2.45 \pm 0.00\) \(2.37 \pm 0.00\)

Using a GP as Prior in DVIP and VIP

In this section, the use of a GP prior in DVIP to approximate GPs and hence DGPs is explored. From a theoretical point of view, GP samples could be used in VIPs prior, ensuring that VIP does converge to a GP when the number of prior samples (and linear regression coefficients) is large enough. As a result, DVIP can converge to a DGP. However, DVIP needs to evaluate covariances among the process values at the training points. This requires taking continuous samples from a GP, something that is not possible in practice unless one samples the process values at all the training points, something that is intractable for big datasets. To surpass this limitation, a BNN with a single layer of cosine activation functions can be used to approximate a GP with RBF kernel (Rahimi & Recht, 2007). Generating continuous samples from this BNN is easy. One only has to sample the weights from their corresponding prior distribution. However, in order to correctly approximate the GP, a large number of hidden units are needed in the BNN, increasing the computational cost of taking such samples.

Furthermore, in many situations, the predictive distribution of a sparse GP is very different from that of a full GP. Meaning that even when using an approximate GP prior in VIP, by means of a BNN with enough hidden units, it may not be enough to accurately approximate a sparse GP. Specifically, this is the case in the Year dataset. In this dataset, the difference in performance between DVIP and DGP is not only a consequence of the different prior, but also the posterior approximation. More precisely, DVIP uses a linear regression approximation to the GP, while DGP uses an inducing points based approximation. To show this, an inducing points approximation is proposed on VIP. For this, the required covariance matrices are estimated using a large number of prior samples. The obtained results can be seen in Table 3.7. There, the results of VIP using an approximated GP prior, using both the linear regression approximation and an approximation based on inducing points. Results of the sparse GP and the average training time of each method in seconds are also reported. The table shows that VIP can be used with inducing points to approximate a GP. Specifically, the inducing points approximation of VIP gives very similar results to those of a sparse GP. However, this is achieved at a prohibitive training time. The computational bottlenecks are the GP approximation using a wide single layer BNN (\(2\ 000\) units in the hidden layer), and the generation of a large number of prior samples from the BNN to approximate the covariance matrix (\(2\ 000\) samples). Given the high computational cost of training a VIP model on this dataset, considering DVIP with more than 1 layer is too expensive.

Table 3.7: VIP versus GP–based variants on the Year dataset. The baseline VIP employs a two–layer BNN prior (width \(=10\), tanh activations) and a linear–regression approximation to the GP. VIP-GP (linear regression) replaces the BNN by a wide single–layer cosine network that approximates an RBF GP but keeps the linear–regression posterior. VIP-GP (inducing points) further switches to an inducing–point posterior with \(100\) pseudo–inputs. A sparse GP (SGP) with the same \(100\) inducing points is included for reference. While the inducing–point version of VIP nearly matches the SGP in predictive metrics (RMSE, NLL, CRPS), it incurs a 25\(\times\) larger training time owing to (i) the very wide cosine network (\(2\ 000\) hidden units) needed to mimic the GP prior and (ii) the \(2\ 000\) Monte Carlo samples required to approximate its covariance matrix. All VIP variants are trained with \(\alpha=0\).
Year VIP VIP-GP (linear regression) VIP-GP (inducing points) SGP
RMSE \(10.27 \pm 0.01\) \(10.23 \pm 0.01\) \(9.28 \pm 0.01\) \(9.15 \pm 0.01\)
NLL \(3.74 \pm 0.00\) \(3.77 \pm 0.00\) \(3.64 \pm 0.00\) \(3.62 \pm 0.00\)
CRPS \(5.45 \pm 0.01\) \(5.45 \pm 0.02\) \(4.85 \pm 0.01\) \(4.83 \pm 0.01\)
CPU Time (s) \(1217 \pm 257\) \(1687\pm 271\) \(30867 \pm 326\) \(1874 \pm 265\)

Extra experiments in the smaller datasets Protein and Kin8nm are conducted to assess if using a GP prior in the context of DVIP generates results that are closer to those of a DGP. These are the regression datasets from the UCI repository where DGPs performed better than DVIP. In this case, inducing points approximation of the GP is not considered, as in the previous paragraph, but the linear regression approximation of VIP. Results include (i) DVIP using the initially proposed BNN prior with 2 layers of 10 hidden units and tanh activation functions; (ii) DVIP using the wide BNN that approximates the prior GP; (iii) DGPs using sparse GPs in each layer based on inducing points. The results obtained are shown in Tables 3.8 and Table 3.9. In both cases using a GP prior in DVIP often improves results and performs similarly to a DGP, especially for a large number of layers \(L\), even when there are differences in the approximation of the posterior distribution, i.e., DVIP uses a linear regression approximation to the GP and DGP uses inducing points. Again, using a wide BNN to approximate the prior GP results in a significant increment in the training time, making DVIP slower than DGP.

Table 3.8: Protein (UCI) — effect of replacing the BNN prior by an approximate GP prior in DVIP. Top block: results with the original two–layer BNN prior. Bottom block: results after substituting the BNN by a wide single–layer cosine network that approximates an RBF GP (“GP Prior”). For each prior, the reported methods include a vanilla VIP, DVIP models with \(L=2\text{--}5\) layers, sparse-GP and deep-GP baselines from Salimbeni & Deisenroth (2017). Using the GP prior narrows the gap between DVIP and DGP—especially for deeper models—yet increases training time by roughly a factor of two, underscoring the computational cost of wide networks and large prior sample counts.
BNN Prior VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5
RMSE \(4.76 \pm 0.01\) \(4.24 \pm 0.01\) \(4.14 \pm 0.01\) \(4.14 \pm 0.01\) \(\bm{4.09 \pm 0.01}\)
NLL \(2.98 \pm 0.00\) \(2.86 \pm 0.00\) \(2.84 \pm 0.00\) \(2.84 \pm 0.00\) \(\bm{2.83 \pm 0.00}\)
CRPS \(2.68 \pm 0.00\) \(2.34 \pm 0.00\) \(2.26 \pm 0.00\) \(2.25 \pm 0.00\) \(\bm{2.21 \pm 0.00}\)
CPU time (s) \(3086 \pm 173\) \(3981 \pm 182\) \(8604 \pm 774\) \(9931 \pm 616\) \(12568 \pm 327\)
GP Prior VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5
RMSE \(4.89\pm 0.01\) \(4.26\pm 0.01\) \(4.07\pm 0.01\) \(4.02\pm 0.01\) \(\bm{4.01\pm 0.01}\)
NLL \(3.00 \pm 0.00\) \(2.87 \pm 0.00\) \(2.82 \pm 0.00\) \(2.81 \pm 0.00\) \(\bm{2.81 \pm 0.00}\)
CRPS \(2.77 \pm 0.00\) \(2.35 \pm 0.00\) \(2.21 \pm 0.00\) \(2.17 \pm 0.00\) \(\bm{2.16 \pm 0.00}\)
CPU time (s) \(5880\pm 249\) \(12293\pm 564\) \(20351\pm 1110\) \(18514\pm 575\) \(25835\pm 1259\)
(Salimbeni & Deisenroth, 2017) SGP DGP 2 DGP 3 DGP 4 DGP 5
RMSE \(4.56 \pm 0.01\) \(4.17 \pm 0.01\) \(\bm{4.00 \pm 0.01}\) \(4.01 \pm 0.01\) \(4.02 \pm 0.01\)
NLL \(2.93\pm 0.00\) \(2.84 \pm 0.00\) \(\bm{2.79 \pm 0.00}\) \(\bm{2.79 \pm 0.00}\) \(2.80 \pm 0.00\)
CRPS \(2.56 \pm 0.00\) \(2.31 \pm 0.00\) \(\bm{2.19 \pm 0.00}\) \(\bm{2.19 \pm 0.00}\) \(2.20 \pm 0.00\)
CPU time (s) \(2690 \pm 114\) \(10031 \pm 129\) \(17528\pm 689\) \(16128\pm 190\) \(20653\pm 969\)
Table 3.9: Kin8nm (UCI) — DVIP with BNN versus GP priors. Replacing the BNN prior by a wide cosine network that approximates an RBF GP (GP Prior) systematically improves the predictive metrics (RMSE, NLL, CRPS) of DVIP and often matches or surpasses the performance of deep Gaussian processes with inducing points. The gain, however, comes at the expense of significantly longer training times—up to \(\sim\!2.3\times\) slower for \(L=5\) layers—because thousands of hidden units and prior samples are required to emulate the GP covariance structure. All VIP models are trained with \(\alpha=0\).
BNN Prior VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5
RMSE \(0.15 \pm 0.00\) \(\bm{0.07 \pm 0.00}\) \(\bm{0.07 \pm 0.00}\) \(\bm{0.07 \pm 0.00}\) \(\bm{0.07 \pm 0.00}\)
NLL \(-0.47 \pm 0.00\) \(-1.13 \pm 0.00\) \(\bm{-1.18 \pm 0.00}\) \(-1.16 \pm 0.00\) \(-1.17 \pm 0.00\)
CRPS \(0.08 \pm 0.00\) \(\bm{0.04 \pm 0.00}\) \(\bm{0.04 \pm 0.00}\) \(\bm{0.04 \pm 0.00}\) \(\bm{0.04 \pm 0.00}\)
CPU time (s) \(2109 \pm 57\) \(5086 \pm 232\) \(6927 \pm 27\) \(9459 \pm 77\) \(11763 \pm 141\)
GP Prior VIP DVIP 2 DVIP 3 DVIP 4 DVIP 5
RMSE \(0.14 \pm 0.00\) \(0.07 \pm 0.00\) \(\bm{0.06 \pm 0.00}\) \(\bm{0.06 \pm 0.00}\) \(\bm{0.06 \pm 0.00}\)
NLL \(-0.43\pm 0.00\) \(-1.20 \pm 0.00\) \(-1.25 \pm 0.00\) \(\bm{-1.26 \pm 0.00}\) \(- 1.25 \pm 0.00\)
CRPS \(0.08 \pm 0.00\) \(0.04 \pm 0.00\) \(\bm{0.03 \pm 0.00}\) \(\bm{0.03 \pm 0.00}\) \(\bm{0.03 \pm 0.00}\)
CPU time (s) \(6672 \pm 442\) \(14300 \pm 847\) \(17573 \pm 1033\) \(22561 \pm 980\) \(22669 \pm 657\)
(Salimbeni & Deisenroth, 2017) SGP DGP 2 DGP 3 DGP 4 DGP 5
RMSE \(0.09 \pm 0.00\) \(\bm{0.06 \pm 0.00}\) \(\bm{0.06 \pm 0.00}\) \(\bm{0.06 \pm 0.00}\) \(\bm{0.06 \pm 0.00}\)
NLL \(-0.91\pm 0.00\) \(-1.29 \pm 0.00\) \(-1.32 \pm 0.00\) \(\bm{-1.33 \pm 0.00}\) \(- 1.30 \pm 0.00\)
CRPS \(0.05 \pm 0.00\) \(\bm{0.03 \pm 0.00}\) \(\bm{0.03 \pm 0.00}\) \(\bm{0.03 \pm 0.00}\) \(\bm{0.03 \pm 0.00}\)
CPU time (s) \(2053 \pm 81\) \(6375 \pm 331\) \(11147 \pm 472\) \(17502 \pm 1060\) \(21846 \pm 1246\)

As a conclusion, DVIPs’ general and flexible definition allows the usage of (approximate) GP priors and inducing points. This enables performing (nearly) equally to a sparse GP or a DGP. However, the computational overhead of doing these approximations is prohibitive. Thus, if a GP prior is to be considered, it is a much better approach to just use the DGP framework, and leave DVIP to cases where flexible but easy-to-sample-from priors can be used.

Conclusions

Deep Variational Implicit Processes are defined as a model based on the concatenation of implicit processes (IPs) as the prior for the latent function. The conducted experiments demonstrate that DVIPs can be used on a variety of regression and classification problems with no need of hand-tuning. Results show that DVIPs surpass or match the performance of single layer VIPs and GPs. It also gives similar and sometimes better results than those of deep GPs (DGPs). However, the main comparative advantage of DVIPs is that they have less computational cost than DGPs. Experiments have also demonstrated that DVIPs are both effective and scalable on a wide range of tasks. DVIPs do not seem to over-fit on small datasets by increasing the depth, and on large datasets, extra layers often improve performance. It is also shown that increasing the number of layers is far more effective than increasing the complexity of the prior of single-layer VIPs, which heavily affects the training time of the model. Besides the added computation time, which is rather minor, no drawbacks to the use of DVIPs instead of single-layer VIPs are seen, but rather significant benefits.

The use of domain specific priors, e.g. considering CNNs in the first layer, has demonstrated to give outstanding results in image-based datasets compared to other GP methods. This establishes a new use of IPs with not-so-general prior functions. The employment of these priors on other domain specific tasks, such as forecasting or data encoding, is foreseen as an emerging field of study. The prior flexibility also results in a generalization of DGPs. As a matter of fact, DVIPs should give similar results to those of DGPs if a GP is considered as the IP prior for each unit.

Despite the good results, DVIPs present some limitations: first of all, the implicit prior works as a black box from the interpretability point of view. The prior parameters do not represent a clear property of the model in comparison to the original GP’s kernel parameters, such as the length and amplitude. Furthermore, even though using \(20\) samples from the prior has shown to give remarkable results in most cases, there might be situations where this number must be increased, having a big impact on the model’s training time.