Mathematical Proofs
Bayesian Inference, Gaussian Processes and Generalization Bounds
Proposition 2.9 Return to statement
The Kullback-Leibler divergence is always non-negative.
Proof. As the logarithm is bounded by \(x - 1\), \begin{equation} \log{x} \leq x - 1 \implies \frac{P(x)}{Q(x)} - 1 \geq \log{\frac{P(x)}{Q(x)}}. \end{equation} Since probabilities are non-negative, multiply by \(Q(x)\) in the last inequality \begin{equation} P(x) - Q(x) \geq Q(x) \log \frac{P(x)}{Q(x)} = Q(x) \log{P(x)} - Q(x) \log{Q(x)}. \end{equation} Now, integrating both sides \begin{equation} 0 \geq \mathbb{E}_Q[\log{P(x)} - \log{Q(x)}] \implies \mathbb{E}_Q[\log{Q(x)} - \log{P(x)}] \geq 0. \end{equation} Moreover, the Kullback-Leibler divergence is \(0\) if and only if the two distributions are equal almost everywhere. □
Proposition 2.10 Return to statement
The \(\alpha\)–divergence interpolates between the two asymmetric KLs: \begin{equation} \lim_{\alpha\to 0} D_\alpha(P|Q)=\mathrm{KL}(Q|P), \quad \text{and,} \quad \lim_{\alpha\to 1} D_\alpha(P|Q)=\mathrm{KL}(P|Q). \end{equation}
Proof. Write \(I(\alpha):=\int P(\bm \theta)^{\alpha} Q(\bm \theta)^{1-\alpha}\, \mathrm d \bm \theta\). The \(\alpha\)–divergence in Equation \(\eqref{eq:alpha_div}\) can be expressed as \begin{equation} D_{\alpha}(P|Q)=\frac{1-I(\alpha)}{\alpha(1-\alpha)}\,. \end{equation} Because \(\int Q(\bm \theta)\, \mathrm d \bm \theta=1\), it verifies that \(I(0)=1\). L’Hôpital’s rule therefore applies: \begin{equation} \lim_{\alpha\to 0}D_{\alpha}(P|Q) =\lim_{\alpha\to 0} \frac{-I'(\alpha)}{1-2\alpha}\,, \end{equation} where \(I'(\alpha)\) denotes the derivative of \(I\) with respect to \(\alpha\). Differentiate under the integral sign (permitted by the positivity of \(P,Q\)): \begin{equation} I'(\alpha) =\int P(\bm \theta)^{\alpha} Q(\bm \theta)^{1-\alpha} \log\frac{P(\bm \theta)}{Q(\bm \theta)}\, \mathrm d\bm \theta. \end{equation} Evaluating at \(\alpha=0\) yields \(I'(0)=\int Q(\bm \theta)\log\frac{P(\bm \theta)}{Q(\bm \theta)}\, \mathrm{d}\bm \theta\). Consequently \begin{equation} \lim_{\alpha\to 0}D_{\alpha}(P|Q) =-\int Q(\bm \theta)\log\frac{P(\bm\theta)}{Q(\bm \theta)}\, \mathrm d\bm \theta =\mathrm{KL}(Q|P). \end{equation} Using the symmetry \(D_{\alpha}(P|Q)=D_{1-\alpha}(Q|P)\), substitute \(\tilde\alpha=1-\alpha\) and apply the first part: \begin{equation} \lim_{\alpha\to 1}D_{\alpha}(P|Q) =\lim_{\tilde\alpha\to 0}D_{\tilde\alpha}(Q|P) =\mathrm{KL}(P|Q)\,. \end{equation} □
Theorem 2.19 Return to statement
Let \(K:\mathcal X\times\mathcal X\to\mathbb R\) be a positive-definite kernel, \(\mathcal H\) its RKHS, and \(\phi_x:=K(\,\cdot\,,x)\in\mathcal H\). Define the centred cylindrical Gaussian measure \begin{equation} P_{\textnormal{cyl}} \;=\; \operatorname{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal{H}}) \quad\text{on the cylindrical } \sigma\text{-algebra }\mathscr C(\mathcal H). \end{equation} For every finite index set \(X=\{x_{1},\dots,x_{m}\}\subset\mathcal X\), \begin{equation} (\delta_{x_1},\dots,\delta_{x_m})_{\#}P_{\textnormal{cyl}} \;=\; \mathcal N\bigl(0,\,K(X,X)\bigr)\,, \end{equation} so \(P_{\textnormal{cyl}}\) reproduces exactly the finite-dimensional marginals of the process \(f\sim\mathcal{GP}(0,K)\).
Proof. For each \(x\in\mathcal X\), the evaluation map \(\delta_x:\mathcal H\to\mathbb R\), \(\delta_x(h)=h(x)\), is a continuous linear functional on \(\mathcal H\) because, by the RKHS property with \(\phi_x:=K(\cdot,x)\), \begin{equation} \delta_x(h)=\langle h,\phi_x\rangle_{\mathcal H} \quad\text{and}\quad |\delta_x(h)|\le \|h\|_{\mathcal H}\,\|\phi_x\|_{\mathcal H} = \|h\|_{\mathcal H}\,\sqrt{K(x,x)}. \end{equation} Hence \(\delta_x=L_{\phi_x}\). For a finite set \(X=\{x_1,\dots,x_m\}\), by the push-forward definition and the previous display, \begin{equation} (\delta_{x_1},\dots,\delta_{x_m})_{\sharp}P_{\mathrm{cyl}} = \mathcal N(0,\,[\langle \phi_{x_j},\,\phi_{x_i}\rangle_{\mathcal H}]_{i,j=1}^m). \end{equation} The reproducing property yields \(\langle \phi_{x_j},\,\phi_{x_i}\rangle_{\mathcal H}=K(x_i,x_j)\), so \begin{equation} (\delta_{x_1},\dots,\delta_{x_m})_{\sharp}P_{\mathrm{cyl}} \;=\; \mathcal N\bigl(0,\,K(X,X)\bigr). \end{equation} Therefore \(P_{\mathrm{cyl}}\) reproduces exactly the finite-dimensional marginals of the Gaussian process \(f\sim\mathcal{GP}(0,K)\), as claimed. □
Theorem 2.20 Return to statement
Let \(\bigl(\mathcal H,\langle\cdot,\cdot\rangle_{\mathcal H}\bigr)\) be a separable infinite-dimensional Hilbert space. Then there exist:
a separable Banach space \((\mathcal B,\|\cdot\|_{\cal{B}})\) that contains \(\mathcal H\) as a dense subspace under \(\|\cdot\|_{\cal{B}}\),
a continuous, injective embedding \(i:\mathcal H\hookrightarrow \mathcal B\) such that the push-forward of the canonical cylindrical Gaussian on \(\mathcal H\) with covariance \(I_{\mathcal H}\), \(\operatorname{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H})\), under \(i\) extends uniquely to a centred Gaussian Radon probability measure \(\gamma = \mathcal N_{\mathcal B}(0,J)\) on \(\mathcal B\) with covariance operator \begin{equation} J=i\circ i^{\ast}:\mathcal B^{\ast} \longrightarrow \mathcal B, \end{equation} where \(i^{\ast}:\mathcal B^{\ast}\to\mathcal H\) is the adjoint operator of \(i\).
such that the triple \((\mathcal B,\gamma,\mathcal H)\) is an abstract Wiener space; that is, \(\mathcal H\) is the Cameron–Martin space of \(\gamma\).
Proof. Step 1: Construct a larger (Hilbert) Banach space and the embedding. Let \((e_k)_{k\ge 1}\) be an orthonormal basis of \(\mathcal H\). Choose weights \((\beta_k)_{k\ge 1}\) with \(\beta_k\ge 1\) and \begin{equation} \label{eq:aws-beta-summable-new} \sum_{k=1}^{\infty}\beta_k^{-2}\;<\;\infty, \quad \text{for example}, \ \beta_k = k\,. \end{equation} Define, for \(h\in\mathcal H\), \begin{equation} \label{eq:aws-B-norm-new} \|h\|_{\mathcal B}^{2} \;:=\;\sum_{k=1}^{\infty}\beta_k^{-2}\,|\langle h,e_k\rangle_{\mathcal H}|^{2}. \end{equation} Since \(\beta_k^{-2}\le 1\), there holds \(\|h\|_{\mathcal B}\le \|h\|_{\mathcal H}\) for all \(h\in\mathcal H\); hence, the identity map \begin{equation} \label{eq:aws-id-cont-new} \mathrm{id}:(\mathcal H,\|\cdot\|_{\mathcal H})\longrightarrow (\mathcal H,\|\cdot\|_{\mathcal B}) \end{equation} is continuous. Let \(\mathcal B\) be the completion of \(\mathcal H\) with respect to \(\|\cdot\|_{\mathcal B}\). The sesquilinear form \begin{equation} \label{eq:aws-B-inner-new} \langle h,g\rangle_{\mathcal B} \;:=\;\sum_{k=1}^{\infty}\beta_k^{-2}\,\langle h,e_k\rangle_{\mathcal H}\,\langle g,e_k\rangle_{\mathcal H}, \qquad h,g\in\mathcal H, \end{equation} extends by continuity to an inner product on \(\mathcal B\); therefore \(\mathcal B\) is a separable Hilbert space (in particular, a Banach space), and \(\mathcal H\) is dense in \(\mathcal B\). Let \begin{equation} \label{eq:aws-embed-new} i:\mathcal H\hookrightarrow\mathcal B \end{equation} be the canonical inclusion. From Equation \(\eqref{eq:aws-B-norm-new}\), \begin{equation} \label{eq:aws-HS-norm-new} \|i(e_k)\|_{\mathcal B}^{2}\;=\;\beta_k^{-2}, \qquad \sum_{k=1}^{\infty}\|i(e_k)\|_{\mathcal B}^{2}\;=\;\sum_{k=1}^{\infty}\beta_k^{-2}\;<\;\infty, \end{equation} so \(i\) is Hilbert–Schmidt. This will guarantee the square-summability of the Gaussian series below.
Step 2: Build a \(\mathcal B\)-valued Gaussian by an \(L^{2}\)-convergent series.
Let \((\xi_k)_{k\ge 1}\) be independent standard normal variables on some \((\Omega,\mathscr F,\mathbb P)\) and define \begin{equation} \label{eq:aws-XN-new} X_N\;:=\;\sum_{k=1}^{N}\xi_k\,i(e_k)\quad\in\mathcal B, \qquad N\ge 1. \end{equation} For integers \(M<N\), by the bilinearity of \(\langle\cdot,\cdot\rangle_{\mathcal B}\) and their independence, \begin{align} \label{eq:aws-Cauchy-1-new} \mathbb E\|X_N-X_M\|_{\mathcal B}^{2} &= \mathbb E\Big\langle \sum_{k=M+1}^{N}\xi_k\,i(e_k),\ \sum_{j=M+1}^{N}\xi_j\,i(e_j)\Big\rangle_{\mathcal B} \\ \label{eq:aws-Cauchy-2-new} &= \sum_{k=M+1}^{N}\sum_{j=M+1}^{N}\mathbb E[\xi_k\xi_j]\ \langle i(e_k),i(e_j)\rangle_{\mathcal B} \\ \label{eq:aws-Cauchy-3-new} &= \sum_{k=M+1}^{N}\|i(e_k)\|_{\mathcal B}^{2} \;=\; \sum_{k=M+1}^{N}\beta_k^{-2}. \end{align} Equation \(\eqref{eq:aws-Cauchy-2-new}\) utilises the linearity of expectation; Equation \(\eqref{eq:aws-Cauchy-3-new}\) uses \(\mathbb E[\xi_k\xi_j]=\delta_{kj}\). By Equation \(\eqref{eq:aws-beta-summable-new}\), the right-hand side tends to \(0\) as \(M,N\to\infty\). Thus \((X_N)\) is Cauchy in \(L^{2}(\Omega;\mathcal B)\); hence, there exists a \(\mathcal B\)-valued random element \(X\) with \begin{equation} \label{eq:aws-X-limit-new} X_N\xrightarrow[N\to\infty]{L^{2}(\Omega;\mathcal B)} X. \end{equation} Define the probability measure \begin{equation} \label{eq:aws-gamma-def-new} \gamma \;:=\; \mathbb P\circ X^{-1} \end{equation} on the Borel \(\sigma\)-algebra of \(\mathcal B\). Since \(X\) is an \(L^{2}\)-limit of finite Gaussian sums, \(\gamma\) is a centred Gaussian probability measure on \(\mathcal B\). To justify this, observe that for every \(\ell\in \cal{B}^{\ast}\) we have \begin{equation} \ell(X_N)=\sum_{k=1}^{N}\xi_{k}\,\ell(i(e_{k})), \end{equation} a finite linear combination of independent \(\mathcal N(0,1)\) variables; hence, a centred real Gaussian. Since \(\ell\) is continuous, \begin{equation} \mathbb E\left|\ell(X_N)-\ell(X)\right|^2 \le \|\ell\|^2\,\mathbb E\|X_N-X\|^2 \xrightarrow[N\to\infty]{} 0, \end{equation} so \(\ell(X_N)\to\ell(X)\) in \(L^2(\Omega)\). Moreover, with \(a_k:=\ell(i(e_k))\) we have \begin{equation} \sum_{k\ge1} a_k^2 =\sum_{k\ge1} |\ell(i(e_k))|^2 \le \|\ell\|^2\sum_{k\ge1}\|i(e_k)\|_{\mathcal B}^2 =\|\ell\|^2\sum_{k\ge1}\beta_k^{-2}<\infty, \end{equation} so \(\ell(X)=\sum_{k\ge1} a_k\xi_k\) in \(L^2\), hence \(\ell(X)\sim\mathcal N(0,\sum a_k^2)\). By the standard characterization (a \(\cal B\)–valued law is Gaussian if and only if all its continuous linear functionals are real Gaussian), the push-forward \(\gamma=\mathbb P\circ X^{-1}\) is a centred Gaussian probability measure on \(\cal{B}\).
Step 3: Identify the covariance operator as \(J=i\,i^{\ast}:\mathcal B\to\mathcal B\).
Let \(i^{\ast}:\mathcal B\to\mathcal H\) denote the Hilbert adjoint of \(i\), that is, \begin{equation} \label{eq:aws-adjoint-def-new} \langle i (h),\, v\rangle_{\mathcal B} \;=\; \langle h,\, i^{\ast}(v)\rangle_{\mathcal H} \qquad\text{for all }h\in\mathcal H,\ v\in\mathcal B. \end{equation} Fix \(v,w\in\mathcal B\). Using Equation \(\eqref{eq:aws-X-limit-new}\) and the Cauchy–Schwarz inequality in \(L^{2}\), the mapping \(Y\mapsto \mathbb E[\langle Y,v\rangle_{\mathcal B}\langle Y,w\rangle_{\mathcal B}]\) is continuous for \(L^{2}\)-convergence; hence, \begin{equation} \label{eq:aws-cov-limit-pass-new} \mathbb E\bigl[\langle X,v\rangle_{\mathcal B}\,\langle X,w\rangle_{\mathcal B}\bigr] \;=\; \lim_{N\to\infty}\mathbb E\bigl[\langle X_N,v\rangle_{\mathcal B}\,\langle X_N,w\rangle_{\mathcal B}\bigr]. \end{equation} Expanding \(X_N\) from Equation \(\eqref{eq:aws-XN-new}\) and using bilinearity, \begin{align} \label{eq:aws-cov-expand-new} \mathbb E\bigl[\langle X_N,v\rangle_{\mathcal B}\,\langle X_N,w\rangle_{\mathcal B}\bigr] &= \mathbb E\Bigl[\Bigl\langle \sum_{k=1}^{N}\xi_k i(e_k),\, v\Bigr\rangle_{\mathcal B} \Bigl\langle \sum_{j=1}^{N}\xi_j i(e_j),\, w\Bigr\rangle_{\mathcal B}\Bigr] \\ \label{eq:aws-cov-double-new} &= \sum_{k=1}^{N}\sum_{j=1}^{N} \mathbb E[\xi_k\xi_j]\, \langle i(e_k),v\rangle_{\mathcal B}\,\langle i(e_j),w\rangle_{\mathcal B} \\ \label{eq:aws-cov-collapse-new} &= \sum_{k=1}^{N} \langle i(e_k),v\rangle_{\mathcal B}\,\langle i(e_k),w\rangle_{\mathcal B}. \end{align} Equation \(\eqref{eq:aws-cov-double-new}\) uses the linearity of expectation; Equation \(\eqref{eq:aws-cov-collapse-new}\) uses \(\mathbb E[\xi_k\xi_j]=\delta_{kj}\). Now use the definition of the adjoint, Equation \(\eqref{eq:aws-adjoint-def-new}\), to rewrite \begin{equation} \label{eq:aws-adjoint-step-new} \langle i(e_k),v\rangle_{\mathcal B}\;=\;\langle e_k,\,i^{\ast}(v)\rangle_{\mathcal H}. \end{equation} Substituting Equation \(\eqref{eq:aws-adjoint-step-new}\) into Equation \(\eqref{eq:aws-cov-collapse-new}\) and letting \(N\to\infty\), the Parseval identity in \(\mathcal H\) yields \begin{align} \label{eq:aws-parseval-new} \mathbb E\bigl[\langle X,v\rangle_{\mathcal B}\,\langle X,w\rangle_{\mathcal B}\bigr] &= \sum_{k=1}^{\infty} \langle e_k,\,i^{\ast}(v)\rangle_{\mathcal H}\,\langle e_k,\,i^{\ast}(w)\rangle_{\mathcal H} \\ \label{eq:aws-cov-J-new} &= \langle i^{\ast}(v),\,i^{\ast}(w)\rangle_{\mathcal H} \;=\; \langle i\,(i^{\ast}(v)),\, w\rangle_{\mathcal B}. \end{align} Thus, the covariance operator of \(\gamma\) is \begin{equation} \label{eq:aws-J-def-new} J\;=\;i\circ i^{\ast}:\mathcal B\longrightarrow\mathcal B. \end{equation}
Step 4: \(J\) is trace class; explicit diagonalisation.
Define \begin{equation} \label{eq:aws-gk-def-new} g_k\;:=\;\beta_k\,i(e_k)\in\mathcal B. \end{equation} From Equation \(\eqref{eq:aws-B-inner-new}\) and Equation \(\eqref{eq:aws-HS-norm-new}\), \begin{equation} \label{eq:aws-gk-onb-new} \langle g_k,g_j\rangle_{\mathcal B} \;=\;\beta_k\beta_j\,\langle i(e_k),i(e_j)\rangle_{\mathcal B} \;=\;\beta_k\beta_j\,\beta_k^{-2}\delta_{kj} \;=\;\delta_{kj}, \end{equation} so \((g_k)_{k\ge 1}\) is an orthonormal basis of \(\mathcal B\). Moreover, for \(e_k\in\mathcal H\), \begin{equation} \label{eq:aws-iast-i-onH-new} \langle (i^{\ast}\circ i)e_k,\,e_j\rangle_{\mathcal H} \;=\;\langle i(e_k),\,i(e_j)\rangle_{\mathcal B} \;=\;\beta_k^{-2}\delta_{kj}, \end{equation} hence \((i^{\ast}\circ i)e_k=\beta_k^{-2}e_k\). Using Equation \(\eqref{eq:aws-J-def-new}\), \begin{align} \label{eq:aws-J-on-gk-new} J (g_k) &= i\circ i^{\ast}\bigl(\beta_k i(e_k)\bigr) \;=\;\beta_k\, i\bigl((i^{\ast}\circ i)e_k\bigr) \;=\;\beta_k\, i(\beta_k^{-2}e_k) \;=\;\beta_k^{-2}\,g_k. \end{align} Therefore \(J\) is diagonal in the orthonormal basis \((g_k)\) with eigenvalues \((\beta_k^{-2})\), and \begin{equation} \label{eq:aws-traceJ-new} \operatorname{tr}J \;=\;\sum_{k=1}^{\infty}\langle J (g_k),\,g_k\rangle_{\mathcal B} \;=\;\sum_{k=1}^{\infty}\beta_k^{-2} \;<\;\infty, \end{equation} so \(J\) is trace class. In particular, \(\gamma=\mathcal N_{\mathcal B}(0,J)\) is a centred Gaussian probability measure on the separable Hilbert space \(\mathcal B\), hence a Radon measure (every Borel probability on a separable Hilbert space is Radon).
Step 5: The push-forward of the cylindrical Gaussian equals \(\gamma\).
Let \(\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H})\) denote the standard cylindrical Gaussian pre-measure on \(\mathcal H\). Push it forward to \(\mathcal B\) by \(i:\mathcal H\to\mathcal B\) and set \begin{equation} \mu\;:=\;i_{\sharp}\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H}). \end{equation} a cylindrical pre-measure on \(\mathcal B\) defined on the cylindrical \(\sigma\)-algebra generated by \(\mathcal B^{\ast}\). Because \(\mathcal B\) is a Hilbert space, the Riesz map \(R_{\mathcal B}:\mathcal B\to\mathcal B^{\ast}\), \(R_{\mathcal B}(v)=\langle \,\cdot\,,v\rangle_{\mathcal B}\), is an isometric isomorphism. For each \(\ell\in\mathcal B^{\ast}\) let \(v_\ell:=R_{\mathcal B}^{-1}(\ell)\in\mathcal B\). Then for \(x\in\mathcal H\), \begin{equation} (\ell\circ i)(x)=\ell(i(x))=\langle i(x),v_\ell\rangle_{\mathcal B} =\langle x,\,i^{\ast}(v_\ell)\rangle_{\mathcal H}. \end{equation} Thus \(L_\ell:=\ell\circ i\in\mathcal H^{\ast}\) is represented by \(h_\ell:=i^{\ast}(v_\ell)\in\mathcal H\). Fix \(\ell_1,\dots,\ell_m\in\mathcal B^{\ast}\). By definition of \(\mu\) and of the cylindrical Gaussian on \(\mathcal H\), \begin{equation} \label{eq:aws-step5-mu-marg} (\ell_1,\dots,\ell_m)_{\sharp}\mu \;=\;(L_{\ell_1},\dots,L_{\ell_m})_{\sharp}\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H}) \;=\;\mathcal N\bigl(0,\,[\,\langle h_{\ell_p},h_{\ell_q}\rangle_{\mathcal H}\,]_{p,q}\bigr). \end{equation} By Step 2, for every \(\ell\in\mathcal B^{\ast}\), \begin{equation} \ell(X)=\sum_{k\ge1}\xi_k\,\ell(i(e_k))\quad\text{in }L^2(\Omega), \qquad \sum_{k\ge1}|\ell(i(e_k))|^2 \le \|\ell\|^2\sum_{k\ge1}\beta_k^{-2}<\infty, \end{equation} so \(\ell(X)\) is centred Gaussian with variance \(\sum_{k\ge1}\ell(i(e_k))^2\). For \(\ell_p,\ell_q\in\mathcal B^{\ast}\), \begin{equation} \mathbb E[\ell_p(X)\,\ell_q(X)] \;=\;\sum_{k\ge1}\ell_p(i(e_k))\,\ell_q(i(e_k)) \;=\;\sum_{k\ge1}\langle e_k,\,h_{\ell_p}\rangle_{\mathcal H}\,\langle e_k,\,h_{\ell_q}\rangle_{\mathcal H} \;=\;\langle h_{\ell_p},h_{\ell_q}\rangle_{\mathcal H}, \end{equation} where we used \(\ell(i(e_k))=\langle i(e_k),v_\ell\rangle_{\mathcal B}=\langle e_k,i^{\ast}(v_\ell)\rangle_{\mathcal H}\) and Parseval in \(\mathcal H\). Hence \begin{equation} \label{eq:aws-step5-gamma-marg} (\ell_1,\dots,\ell_m)_{\sharp}\gamma=\mathcal N\bigl(0,\,[\,\langle h_{\ell_p},h_{\ell_q}\rangle_{\mathcal H}\,]_{p,q}\bigr). \end{equation} Comparing Equations \(\eqref{eq:aws-step5-mu-marg}\) and \(\eqref{eq:aws-step5-gamma-marg}\) shows that \(\mu\) and \(\gamma\) agree on all finite-dimensional distributions determined by \(\mathcal B^{\ast}\). Let \(\mathcal C(\mathcal B^{\ast})\) be the cylindrical \(\sigma\)-algebra generated by \(\{\ell:\mathcal B\to\mathbb R:\ell\in\mathcal B^{\ast}\}\). The previous paragraph implies \(\mu=\gamma\) on \(\mathcal C(\mathcal B^{\ast})\). Since \(\mathcal B\) is separable, \(\mathcal C(\mathcal B^{\ast})=\mathscr B(\mathcal B)\) (Bogachev, 1998; Kuo, 2006); hence, \(\mu=\gamma\) on \(\mathscr B(\mathcal B)\).
Step 6: Identify the Cameron–Martin space.
Let \(J^{1/2}\) be the unique positive square root of \(J=i\,i^{\ast}\). The Cameron–Martin space \(\mathcal H_{\gamma}\) is \(\mathrm{Ran}\,J^{1/2}\) with the inner product \begin{equation} \label{eq:aws-CM-inner-new} \langle u,v\rangle_{\mathcal H_{\gamma}} \;:=\;\langle J^{-1/2}(u),\,J^{-1/2}(v)\rangle_{\mathcal B}. \end{equation} Using Equation \(\eqref{eq:aws-J-on-gk-new}\), \(J^{1/2}(g_k)=\beta_k^{-1}g_k\). For \(h=\sum_{k}a_k e_k\in\mathcal H\), \begin{equation} \label{eq:aws-i-exp-new} i(h)\;=\;\sum_{k=1}^{\infty}a_k\,i(e_k) \;=\;\sum_{k=1}^{\infty}a_k\,\beta_k^{-1}g_k, \qquad J^{-1/2}i(h)\;=\;\sum_{k=1}^{\infty}a_k\,g_k. \end{equation} Hence, for \(h,g\in\mathcal H\), \begin{equation} \label{eq:aws-CM-iso-new} \langle i(h),\,i(g)\rangle_{\mathcal H_{\gamma}} \;=\;\Bigl\langle \sum_{k}a_k g_k,\ \sum_{k}b_k g_k\Bigr\rangle_{\mathcal B} \;=\;\sum_{k=1}^{\infty}a_k b_k \;=\;\langle h,g\rangle_{\mathcal H}. \end{equation} Thus \(i:\mathcal H\to\mathcal H_{\gamma}\) is an isometric isomorphism onto its range and \begin{equation} \label{eq:aws-CM-range-new} \mathrm{Ran}\,J^{1/2}\;=\;i(\mathcal H). \end{equation} Therefore \(\mathcal H\) is the Cameron–Martin space of \(\gamma\).
The construction provides a separable Banach space \(\mathcal B\) containing \(\mathcal H\) densely, a continuous injective map \(i:\mathcal H\hookrightarrow\mathcal B\) such that \(i_{\sharp}\mathrm{Gauss}_{\mathrm{cyl}}(0,I_{\mathcal H})\) extends to the centred Gaussian Radon measure \(\gamma=\mathcal N_{\mathcal B}(0,J)\) with covariance \(J=i\,i^{\ast}\), and the Cameron–Martin space of \(\gamma\) is \(\mathcal H\). This proves the theorem. □
Theorem 2.21 Return to statement
Let \(f \sim \mathcal{GP}\bigl(m,K\bigr)\), with \(m:\mathcal X\to\mathbb R\) and \(K:\mathcal X\times\mathcal X\to\mathbb R\). Let \(m\in\mathcal H\), where \(\mathcal H\) is the RKHS generated by \(K\) with sections \(\phi_x:=K(\,\cdot\,,x)\). Let \((\mathcal B,\gamma,\mathcal H)\) be the abstract Wiener space constructed in Theorem 2.20; write \(i:\mathcal H\hookrightarrow\mathcal B\) for the embedding and let \(J=i\circ i^{\ast}\) be its covariance operator so that \(\gamma=\mathcal N_{\mathcal B}(0,J)\). Let \(Z\sim\gamma\) and define \(F := i(m) + Z\). Then:
\(F\) is a \(\mathcal B\)-valued Gaussian random element with law \begin{equation} F \sim \mathcal N_{\mathcal B}\bigl(i(m),\;J\bigr). \end{equation}
Let \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) denote the Paley–Wiener map associated with \((\mathcal B,\gamma,\mathcal H)\), i.e. for \(h\in\mathcal H\), \(\mathcal W(h)\) is the \(\gamma\)-measurable linear functional on \(\mathcal B\) satisfying \(\mathcal W(h)\bigl(i (g)\bigr)=\langle h,g\rangle_{\mathcal H}\) for all \(g\in\mathcal H\). Then, for every finite set \(X=\{x_{1},\dots,x_{n}\}\subset\mathcal X\), \begin{equation} \bigl(\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n})\bigr)_{\sharp} \mathcal N_{\mathcal B}\bigl(i(m),\;J\bigr) \;=\; \mathcal N\bigl(m(X),\,K(X,X)\bigr). \end{equation} Equivalently, \(\bigl(\mathcal W(\phi_{x_1})(F),\dots,\mathcal W(\phi_{x_n})(F)\bigr) \sim \mathcal N\bigl(m(X),\,K(X,X)\bigr)\).
Remark. If, in addition, the point evaluations \(\delta_x\) are continuous on \(\mathcal B\) (i.e. \(\delta_x\in\mathcal B^{\ast}\) for all \(x\in\mathcal X\)), then \(\mathcal W(\phi_x)=\delta_x\) \(\gamma\)-a.s., and the display in (2) coincides with the push-forward by \((\delta_{x_1},\dots,\delta_{x_n})\).
Proof. Let \((\mathcal B,\gamma,\mathcal H)\) be the abstract Wiener space produced by Theorem 2.20. Thus \(\mathcal B\) is a separable Hilbert space, \(i:\mathcal H\hookrightarrow\mathcal B\) is continuous and injective, and \(\gamma=\mathcal N_{\mathcal B}(0,J)\) is a centred Gaussian Radon measure with covariance \(J\).
Step 1: \(F=i(m)+Z\) is Gaussian with mean \(i(m)\) and covariance \(J\).
Let \(Z\sim\gamma=\mathcal N_{\mathcal B}(0,J)\) and set \begin{equation} \label{eq:def-F} F \;:=\; i(m)+Z . \end{equation} For any \(v\in\mathcal B\), \(\langle F,v\rangle_{\mathcal B} = \langle i(m),v\rangle_{\mathcal B} + \langle Z,v\rangle_{\mathcal B}\) is the sum of a constant and a centred real Gaussian; hence \(\langle F,v\rangle_{\mathcal B}\) is real Gaussian with mean \(\langle i(m),v\rangle_{\mathcal B}\) and variance \(\langle Jv,v\rangle_{\mathcal B}\). Since this holds simultaneously for all finite families \(v_1,\dots,v_n\) by linearity, \(F\) is a \(\mathcal B\)–valued Gaussian random element with \begin{equation} \label{eq:law-F} F \sim \mathcal N_{\mathcal B}\bigl(i(m),\,J\bigr) \,, \end{equation} which proves item (1).
Step 2: Paley–Wiener map and its basic identities.
Write \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) for the Paley–Wiener map associated with \((\mathcal B,\gamma,\mathcal H)\). By definition, for every \(h,g\in\mathcal H\), \begin{equation} \label{eq:PW-reproducing} \mathcal W(h)\bigl(i(g)\bigr) \;=\; \langle h,\,g\rangle_{\mathcal H} . \end{equation} Moreover, \(\mathcal W\) is linear and isometric into \(L^{2}(\gamma)\): \begin{equation} \label{eq:PW-isometry} \mathbb E_{\gamma}\bigl[\mathcal W(h)\,\mathcal W(g)\bigr] \;=\; \langle h,\,g\rangle_{\mathcal H}, \qquad \mathbb E_{\gamma}\bigl[\mathcal W(h)\bigr]=0 . \end{equation} Since \(F\) has the translated law \(T_{i(m)\,\sharp}\gamma\), with \(T_{i(m)}(x) := x+i(m)\), and translations by \(i(m)\) with \(m\in\mathcal H\) being quasi-invariant (Cameron–Martin Theorem), \(\mathcal W(h)\) is defined \(F\)-a.s. as well. This means that any set where \(W(h)\) fails to be defined has \(\gamma\)-measure zero, and by quasi-invariance, it also has zero measure under the translated law of \(F\). Therefore, \(W(h)(F)\) exists with probability one.
Step 3: Mean and covariance of \(\mathcal W(h)(F)\).
By linearity of \(\mathcal W(h)\) and Equation \(\eqref{eq:def-F}\), \begin{equation} \label{eq:W-on-F-split} \mathcal W(h)(F) \;=\; \mathcal W(h)\bigl(i(m)\bigr) \;+\; \mathcal W(h)(Z). \end{equation} Using Equation \(\eqref{eq:PW-reproducing}\) with \(g=m\), \begin{equation} \label{eq:mean-WF} \mathbb E\bigl[\mathcal W(h)(F)\bigr] \;=\; \langle h,\,m\rangle_{\mathcal H}. \end{equation} For \(h,g\in\mathcal H\), Equation \(\eqref{eq:PW-isometry}\) and the fact that \(Z\) is centred yield \begin{equation} \label{eq:cov-WF} \operatorname{Cov}\!\big(\mathcal W(h)(F),\,\mathcal W(g)(F)\big) \;=\; \operatorname{Cov}\!\big(\mathcal W(h)(Z),\,\mathcal W(g)(Z)\big) \;=\; \langle h,\,g\rangle_{\mathcal H}. \end{equation}
Step 4: Finite-dimensional laws for the sections \(h=\phi_{x}\).
Let \(X=\{x_1,\dots,x_n\}\subset\mathcal X\) and set \(h_j:=\phi_{x_j}=K(\cdot,x_j)\in\mathcal H\). By the reproducing property of the RKHS, \begin{equation} \label{eq:RKHS-reprod} \langle \phi_{x_i},\,m\rangle_{\mathcal H} \;=\; m(x_i), \qquad \langle \phi_{x_i},\,\phi_{x_j}\rangle_{\mathcal H} \;=\; K(x_i,x_j). \end{equation} Combining Equations \(\eqref{eq:mean-WF}\), \(\eqref{eq:cov-WF}\), and \(\eqref{eq:RKHS-reprod}\) gives \begin{equation} \label{eq:mean-cov-vector} \mathbb E\!\begin{bmatrix} \mathcal W(\phi_{x_1})(F) \\[-2pt] \vdots \\[-2pt] \mathcal W(\phi_{x_n})(F) \end{bmatrix} \;=\; \begin{bmatrix} m(x_1) \\[-2pt] \vdots \\[-2pt] m(x_n) \end{bmatrix}, \qquad \operatorname{Cov}\!\begin{bmatrix} \mathcal W(\phi_{x_1})(F) \\[-2pt] \vdots \\[-2pt] \mathcal W(\phi_{x_n})(F) \end{bmatrix} \;=\; K(X,X). \end{equation} Finally, since \(F\) is Gaussian in \(\mathcal B\) and each \(\mathcal W(\phi_{x_j})\) is linear, the vector \[\bigl(\mathcal W(\phi_{x_1})(F),\dots,\mathcal W(\phi_{x_n})(F)\bigr)\] is multivariate normal with mean and covariance as in Equation \(\eqref{eq:mean-cov-vector}\). Hence \begin{equation} \label{eq:pushforward-PW} \bigl(\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n})\bigr)_{\sharp} \mathcal N_{\mathcal B}\bigl(i(m),J\bigr) \;=\; \mathcal N\bigl(m(X),\,K(X,X)\bigr), \end{equation} which proves item (2).
Remark on evaluations.
If, in addition, every point-evaluation \(\delta_x\) is continuous on \(\mathcal B\) (equivalently \(\delta_x\in\mathcal B^{\ast}\)), then the Paley–Wiener functional coincides \(\gamma\)-a.s. with evaluation: \begin{equation} \label{eq:PW-equals-eval} \mathcal W(\phi_x) \;=\; \delta_x \quad \gamma\text{-a.s.} \end{equation} In that case, Equation \(\eqref{eq:pushforward-PW}\) is the stated identity with \((\delta_{x_1},\dots,\delta_{x_n})\). □
Theorem 2.22 Return to statement
Let \(K\) be a positive-definite kernel with RKHS \(\mathcal H\) (sections \(\phi_x:=K(\cdot,x)\)). Let \((\mathcal B,\gamma,\mathcal H)\) be an abstract Wiener space with continuous embedding \(i:\mathcal H\hookrightarrow\mathcal B\). Fix \(m\in\mathcal H\) and \(\Sigma\in\mathcal L^{+}(\mathcal H)\).
\(P := \mathcal N_{\mathcal B}\bigl(i(m),\; i\Sigma i^{\ast}\bigr)\) is a well-defined (Radon) Gaussian measure on \(\mathcal B\) with mean \(i(m)\) and covariance operator \(C=i\Sigma i^{\ast}\).
Let \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) denote the Paley–Wiener map associated with \((\mathcal B,\gamma,\mathcal H)\), i.e. for \(h\in\mathcal H\), \(\mathcal W(h)\) is the \(\gamma\)-measurable linear functional on \(\mathcal B\) satisfying \(\mathcal W(h)\bigl(i (g)\bigr)=\langle h,g\rangle_{\mathcal H}\) for all \(g\in\mathcal H\). Then, for every finite set \(X=\{x_{1},\dots,x_{n}\}\subset\mathcal X\), \begin{equation} \bigl(\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n})\bigr)_{\sharp} P \;=\; \mathcal N\bigl(m(X),\,K_\Sigma(X,X)\bigr). \end{equation}
Remark. If, in addition, the point evaluations \(\delta_x\) are continuous on \(\mathcal B\) (i.e. \(\delta_x\in\mathcal B^{\ast}\) for all \(x\in\mathcal X\)), then \(\mathcal W(\phi_x)=\delta_x\) \(\gamma\)-a.s., and the display in (2) coincides with the push-forward by \((\delta_{x_1},\dots,\delta_{x_n})\).
Proof. Let \((\mathcal B,\gamma,\mathcal H)\) be an abstract Wiener space with continuous injection \(i:\mathcal H\hookrightarrow\mathcal B\). Fix \(m\in\mathcal H\) and a bounded, self-adjoint, positive operator \(\Sigma\in\mathcal L^{+}(\mathcal H)\). Set \begin{equation} C \;:=\; i\,\Sigma\,i^{\ast}\,:\ \mathcal B^{\ast}\longrightarrow \mathcal B . \end{equation}
(1) Existence and form of the Gaussian measure. Because \(i\) is the abstract Wiener space embedding, it is \(\gamma\)-radonifying (Bogachev, 1998; Hytönen et al., 2018); hence \(i\Sigma^{1/2}\) is also \(\gamma\)-radonifying (Hytönen et al., 2018). Let \(\xi\) be an isonormal Gaussian process over \(\mathcal H\) (i.e. a centred Gaussian family with \(\mathbb E\langle h,\xi\rangle=0\) and \(\mathrm{Cov}(\langle h,\xi\rangle,\langle g,\xi\rangle)=\langle h,g\rangle_{\mathcal H}\)). Define the \(\mathcal B\)-valued random elements \begin{equation} Y \;:=\; (i\Sigma^{1/2})\,\xi, \qquad F \;:=\; i(m) + Y . \end{equation} Standard properties of radonifying operators (Bogachev, 1998; Hytönen et al., 2018) imply that \(Y\) is a centred Radon Gaussian in \(\mathcal B\) with covariance operator \(C=i\Sigma i^{\ast}\), so \(F\) is Gaussian Radon with mean \(i(m)\) and covariance \(C\). Equivalently, \(F\sim \mathcal N_{\mathcal B}\bigl(i(m),\,i\Sigma i^{\ast}\bigr)=:P\).
(2) GP marginals under Paley–Wiener evaluations. Let \(\mathcal W:\mathcal H\to L^{2}(\gamma)\) be the Paley–Wiener map of the AWS; for \(h\in\mathcal H\), \(\mathcal W(h)\) is a (measurable) linear functional on \(\mathcal B\) satisfying \(\mathcal W(h)\bigl(i(g)\bigr)=\langle h,g\rangle_{\mathcal H}\) for all \(g\in\mathcal H\). (In particular, these linear functionals admit representatives that are defined \(P\)-a.s.) For the RKHS sections \(\phi_x:=K(\cdot,x)\), the reproducing property gives \begin{equation} \mathbb E\big[\mathcal W(\phi_x)(F)\big] = \mathcal W(\phi_x)\bigl(i(m)\bigr) = \langle \phi_x,m\rangle_{\mathcal H} = m(x). \end{equation} For the covariance, the shift \(i(m)\) drops out and we compute with \(Y\): \begin{equation} \mathcal W(\phi_x)(Y) = \mathcal W(\phi_x)\!\left(i\!\left(\Sigma^{1/2}\xi\right)\right) = \big\langle \phi_x,\;\Sigma^{1/2}\xi\big\rangle_{\mathcal H} = \big\langle \Sigma^{1/2}\phi_x,\;\xi\big\rangle, \end{equation} so \begin{equation} \mathrm{Cov}\!\left(\mathcal W(\phi_x)(F),\mathcal W(\phi_{x'})(F)\right) = \Big\langle \Sigma^{1/2}\phi_x,\;\Sigma^{1/2}\phi_{x'}\Big\rangle_{\mathcal H} = \big\langle \phi_x,\;\Sigma\phi_{x'}\big\rangle_{\mathcal H} =: K_{\Sigma}(x,x'). \end{equation} Therefore, for any finite \(X=\{x_1,\dots,x_n\}\subset\mathcal X\), \begin{equation} \bigl(\mathcal W(\phi_{x_1})(F),\dots,\mathcal W(\phi_{x_n})(F)\bigr) \sim \mathcal N\bigl(m(X),\,K_{\Sigma}(X,X)\bigr), \end{equation} which is exactly the GP law with kernel \(K_{\Sigma}\).
Remark (point evaluations). If, in addition, each \(\delta_x\in\mathcal B^{\ast}\), then for all \(g\in\mathcal H\), \begin{equation} \delta_x\bigl(i(g)\bigr)=g(x)=\langle \phi_x,g\rangle_{\mathcal H} =\mathcal W(\phi_x)\bigl(i(g)\bigr), \end{equation} so \(\delta_x\) and \(\mathcal W(\phi_x)\) agree on the dense subspace \(i(\mathcal H)\) and hence coincide \(\gamma\)-a.s. (and thus \(P\)-a.s.). The push-forward statement can then be written with \((\delta_{x_1},\dots,\delta_{x_n})\) in place of \((\mathcal W(\phi_{x_1}),\dots,\mathcal W(\phi_{x_n}))\). □
Theorem 2.23 Return to statement
Let \(\bm \Theta\) be a finite hypothesis class with cardinality \(M\) and let \(\ell\) be a loss bounded in \([0,C]\). Then for any \(\delta\in(0,1)\), with probability at least \(1-\delta\) over the random draw of the sample \(D\sim\nu^{n}\), the following simultaneous concentration inequality holds: \begin{equation} \forall\bm{\theta}\in\bm\Theta:\qquad L(\bm{\theta})\;\le\; \hat L(D,\bm{\theta}) +C\sqrt{\frac{\log(M/\delta)}{2n}}. \end{equation}
Proof. Since \(\ell\) takes values in \([0,C]\), the random variables \(X_{i}:=\ell\bigl(f_{\bm{\theta}}(\mathbf x_{i}),y_{i}\bigr)-L(\bm{\theta})\) are independent, centered, and bounded by \(C\). Hoeffding’s inequality (Hoeffding, 1963) therefore gives \begin{equation} \mathbb{P} \left(L(\bm{\theta})-\hat L(D,\bm{\theta})>\varepsilon \right) \leq \exp\left(-\frac{2n\varepsilon^{2}}{C^{2}}\right). \end{equation} Equivalently, with probability at least \(1-\delta_{\bm{\theta}}\), \begin{equation} L(\bm{\theta}) \;\le\; \hat L(D,\bm{\theta}) +C\sqrt{\frac{\log(1/\delta_{\bm{\theta}})}{2n}}. \label{eq:pointwise} \end{equation} Setting \(\delta_{\bm{\theta}} := \delta/M\) in Equation \(\eqref{eq:pointwise}\) and applying the union bound yields \begin{equation} \mathbb{P}\left(\exists\bm{\theta}\in\bm\Theta: L(\bm{\theta})-\hat L(D,\bm{\theta}) >C\sqrt{\frac{\log(M/\delta)}{2n}}\right) \leq \sum_{\bm{\theta}\in \bm \Theta}\frac{\delta}{M} =\delta\,. \end{equation} Taking the complement proves the first equation of the theorem. □
Theorem 2.24 Return to statement
Let the loss satisfy \(0\le\ell(\cdot, \cdot)\le C\) and fix \(\delta\in(0,1)\). For every \(\lambda>0\) the following event holds with probability at least \(1-\delta\) over the draw of \(D\sim\nu^{n}\): \begin{equation} \forall\rho\in\mathcal P(\Theta):\quad \mathbb{E}_{\bm\theta\sim\rho}[ L(\bm\theta)] \;\le\; \mathbb{E}_{\bm\theta\sim\rho}[ \hat L(D,\bm\theta)] +\frac{\lambda C^{2}}{8n} +\frac{\mathrm{KL}(\rho|\pi)+\log(1/\delta)}{\lambda}. \end{equation}
Proof. Fix an arbitrary \(\lambda>0\). For each \(\bm{\theta}\in\Theta\) define the centered random variable \begin{equation} Z_{\bm{\theta}}(D) := \exp\left\{ \lambda\left(L(\bm{\theta})-\hat L(D,\bm{\theta})\right) -\frac{\lambda^{2}C^{2}}{8n} \right\}. \end{equation} Because \(L(\bm{\theta})\) is deterministic with respect to \(D\), it verifies that \begin{equation} L(\bm{\theta})-\hat L(D,\bm{\theta}) = \frac1n\sum_{i=1}^n \left[ \mathbb{E}_{(\mathbf x,y)\sim\nu}\, \ell\left(f_{\bm{\theta}}(\mathbf x),y\right) - \ell\left(f_{\bm{\theta}}(\mathbf x_i),y_i\right) \right]\,. \end{equation} Each summand is independent, has mean 0, and lies in \([-C,C]\). Hoeffding’s lemma therefore yields \begin{equation} \mathbb{E}_{D} \left[ \exp\left( \lambda \left(L(\bm{\theta})-\hat L(D,\bm{\theta})\right) \right) \right] \le \exp\left(\tfrac{\lambda^{2}C^{2}}{8n}\right), \end{equation} so \(\mathbb{E}_{D} \bigl[Z_{\bm{\theta}}(D)\bigr]\le 1\) for every \(\bm{\theta}\).
Define the (random) moment-generating function \(M(D):=\mathbb{E}_{\bm{\theta}\sim\pi} \bigl[Z_{\bm{\theta}}(D)\bigr]\). By Fubini’s theorem and the previous bound, \(\mathbb{E}_{D}[M(D)]\le 1\). Applying Markov’s inequality gives \begin{equation} \mathbb{P}_{D}\left(M(D) > 1/\delta\right) \le \delta. \end{equation} With probability at least \(1-\delta\) the sample therefore falls in the “good event” \(\mathcal E := \bigl\{D : M(D)\le 1/\delta\bigr\}.\)
Fix \(D\in\mathcal E\) and any posterior \(\rho\in\mathcal P(\bm \Theta)\). Using Jensen’s inequality (the Donsker–Varadhan variational formula), \begin{align} \lambda\,\mathbb{E}_{\rho}\left[L(\bm \theta)-\hat L(D, \bm \theta)\right] -\frac{\lambda^{2}C^{2}}{8n} -\mathrm{KL}(\rho|\pi) &=\mathbb{E}_{\rho} \bigl[\log Z_{\bm{\theta}}(D)\bigr] \\ &\le \log\mathbb{E}_{\rho} \bigl[Z_{\bm{\theta}}(D)\bigr] \\ &\le \log M(D)\;\le\;\log(1/\delta). \end{align} Rearranging and dividing by \(\lambda\) yields \begin{equation} \mathbb{E}_{\rho}\left[L(\bm \theta)\right] \;\le\; \mathbb{E}_{\rho}\left[\hat L(D, \bm \theta)\right] +\frac{\lambda C^{2}}{8n} +\frac{\mathrm{KL}(\rho|\pi)+\log(1/\delta)}{\lambda}\,. \end{equation} Because the argument did not rely on the specific choice of \(\rho\), the bound holds simultaneously for all posteriors. □
Deep Variational Implicit Processes
The family \(\{P_{\mathcal J}\}_{\mathcal J\subset\mathcal X}\) satisfies the Kolmogorov Extension Theorem; that is, for every pair of finite index sets \(\mathcal I\subset\mathcal J\) it verifies that \(\pi_{\mathcal I\mathcal J\,\sharp}P_{\mathcal J}=P_{\mathcal I}\).
Proof. For a given finite set \(\mathcal J=\{\mathbf x_{1},\dots,\mathbf x_{k}\}\subset\mathcal X\), recall that \begin{equation} \mathbf f_{\mathcal J}:\Omega\longrightarrow\mathbb R^{\mathcal J}, \qquad \mathbf f_{\mathcal J}(\omega) =\bigl(g(\omega,\mathbf x_{1}),\dots,g(\omega,\mathbf x_{k})\bigr). \end{equation} By construction \(g\) is measurable in \(\omega\), so \(\mathbf f_{\mathcal J}\) is a measurable map between \((\Omega,\mathcal F)\) and \(\bigl(\mathbb R^{\mathcal J},\mathcal B(\mathbb R)^{\otimes\mathcal J}\bigr)\). Set \(P_{\mathcal J}:=P\, \circ\, \mathbf f_{\mathcal J}^{-1}\). For a fixed finite set \(\mathcal I\subset\mathcal J\), the canonical projection \(\pi_{\mathcal I\mathcal J}:\mathbb R^{\mathcal J}\to\mathbb R^{\mathcal I}\) verifies that for any \(y=(y_x)_{x\in\mathcal J}\in\mathbb R^{\mathcal J}\), \begin{equation} \pi_{\mathcal I\mathcal J}(y) :=(y_x)_{x\in\mathcal I}\;\in\mathbb R^{\mathcal I}. \end{equation} Observe that applying \(\pi_{\mathcal I\mathcal J}\) to \(\mathbf f_{\mathcal J}(\omega)\) simply discards the components with indices in \(\mathcal J\setminus\mathcal I\): \begin{equation} \pi_{\mathcal I\mathcal J}\bigl(\mathbf f_{\mathcal J}(\omega)\bigr) =\bigl(g(\omega,\mathbf x)\bigr)_{\,\mathbf x\in\mathcal I} =\mathbf f_{\mathcal I}(\omega), \qquad\forall\,\omega\in\Omega. \end{equation} Hence, as pointwise maps, \begin{equation} \pi_{\mathcal I\mathcal J}\circ\mathbf f_{\mathcal J} =\mathbf f_{\mathcal I}. \end{equation} Using the definition of push-forward measures (\(f_\sharp Q:=Q\circ f^{-1}\)), it verifies that \begin{equation} \pi_{\mathcal I\mathcal J\,\sharp}P_{\mathcal J} =\pi_{\mathcal I\mathcal J\,\sharp}\bigl(P\circ\mathbf f_{\mathcal J}^{-1}\bigr) =P\circ\bigl(\mathbf f_{\mathcal J}\bigr)^{-1}\circ \pi_{\mathcal I\mathcal J}^{-1} =P\circ\bigl(\pi_{\mathcal I\mathcal J}\circ\mathbf f_{\mathcal J}\bigr)^{-1} =P\circ\bigl(\mathbf f_{\mathcal I}\bigr)^{-1} =P_{\mathcal I},. \end{equation} which is exactly the desired consistency identity. Since the argument holds for every finite \(\mathcal I\subset\mathcal J\subset\mathcal X\), the family \(\{P_{\mathcal J}\}\) is consistent, completing the proof. □
Proposition 3.5 Return to statement
Let \(\bm \Omega = \{\bm{m}_h^l,\bm{S}_h^l: l=1,\ldots,L\,, h=1,\ldots,H_l\}\) denote the parameters of \(Q\) and \(\bm \Theta = \{\bm \theta_h^l: l=1,\ldots,L\,, h=1,\ldots,H_l\}\) the deep IP prior parameters. Then, a variational lower bound can be derived; at whose maximum the KL divergence between \(Q(\{\mathbf{f}^{\,l}, \bm{a}^l\}_{l=1}^L)\) and \(P\left(\{\mathbf{f}^{\,l}, \bm{a}^l\}_{l=1}^L|\cal{D} \right)\) is minimized \begin{equation} \mathcal{L}\left(\bm \Omega,\bm \Theta, \{\bm \sigma_l^2\}_{l=1}^{L-1}\right) = \sum_{n=1}^N \mathbb{E}_{Q(\mathbf{f}_{\cdot,n}^{\, L})} \big[ \log P\big(y_n| \mathbf{f}^{\, L}_{\cdot,n}\big)\big] -\sum_{l=1}^L \sum_{h=1}^{H_l} \text{KL}\big(Q(\bm{a}_h^l) | P(\bm{a}_h^l)\big)\,. \end{equation}
Proof. Consider the general definition of the ELBO in variational inference, i.e. \begin{equation} \mathcal{L}\left(\bm \Omega,\bm \Theta, \{\bm \sigma_l^2\}_{l=1}^{L-1}\right) = \mathbb{E}_{Q(\{\mathbf F^{\,l}, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{P\left(\mathbf y, \{\mathbf F^{\, l}, \mathbf{a}^l\}_{l=1}^L \right)}{Q(\{\mathbf F^{\, l}, \mathbf{a}^l\}_{l=1}^L)} \right]\,, \end{equation} using DVIPs model specification: \begin{align} P\big(\mathbf y, \{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L \big) &= \prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} P(f^l_{h,n} |\mathbf{a}^l_h)P(\mathbf a^l_h)\,,\\ Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L) &= \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} P(f^l_{h,n} |\mathbf{a}^l_h)Q(\mathbf a^l_h)\,, \end{align} the variational lower bound takes the following form: \begin{align} \mathcal{L} &= \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{P\left(\mathbf y, \{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L \right)}{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \right]\\ &= \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{\prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} \cancel{P(f^l_{h,n} |\mathbf{a}^l_h)}P(\mathbf a^l_h)}{\prod_{n=1}^N \prod_{l=1}^L \prod_{h=1}^{H_l} \cancel{P(f^l_{h,n} |\mathbf{a}^l_h)}Q(\mathbf a^l_h)} \right]\\ &= \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{\prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \prod_{l=1}^L \prod_{h=1}^{H_l}P(\mathbf a^l_h)}{\prod_{l=1}^L \prod_{h=1}^{H_l} Q(\mathbf a^l_h)} \right]\,. \end{align} From this expression, the expectation can be split into two terms \begin{equation} \mathcal{L} = \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \right] + \mathbb{E}_{Q(\{\mathbf F^l, \mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{ \prod_{l=1}^L \prod_{h=1}^{H_l}P(\mathbf a^l_h)}{\prod_{l=1}^L \prod_{h=1}^{H_l} Q(\mathbf a^l_h)} \right]\,. \end{equation} The logarithm in the first term does not depend on the regression coefficients \(\{\bm A^l \}_{l=1}^L\) and neither on \(\{\bm F^l \}_{l=1}^{L-1}\). On the other hand, the logarithm in the second term does not depend on \(\{\bm F^l \}_{l=1}^{L}\) . Thus, \begin{align} \mathcal{L} &= \mathbb{E}_{Q(\mathbf F^L)} \left[ \log \prod_{n=1}^N P(y_n |\mathbf f_{\cdot,n}^L) \right] + \mathbb{E}_{Q( \{\mathbf{a}^l\}_{l=1}^L)} \left[ \log \frac{ \prod_{l=1}^L \prod_{h=1}^{H_l}P(\mathbf a^l_h)}{\prod_{l=1}^L \prod_{h=1}^{H_l} Q(\mathbf a^l_h)} \right]\\ &=\sum_{n=1}^N \mathbb{E}_q \big[ \log P\big(y_n|\mathbf{f}^L_{\cdot,n}\big)\big] - \sum_{l=1}^L \sum_{h=1}^{H_l} \text{KL}\big(Q(\mathbf{a}_h^l) \big \rvert P(\mathbf{a}_h^l)\big)\,. \end{align} □
Variational Linearized Laplace Approximation
Theorem 4.1 Return to statement
For any kernel function \(K\) and its corresponding RKHS \(\cal{H}\) —with feature maps defined as \(\phi_{\mathbf{x}} := K(\cdot, \mathbf{x}) \in \cal{H}\)—, any \(\bm{a}\in \mathbb{R}^M\), \(\bm{A} \in \mathbb{R}^{M \times M}\) with \(\bm{A} \succeq 0\) and \(\mathbf{Z} = (\mathbf{z}_1, \dots, \mathbf{z}_M) \in \mathcal{X}^{M}\), let \(\tilde{\mu} \in \mathcal{H}\) and \(\tilde{\Sigma} : \cal{H} \to \cal H\) be defined as \begin{equation} \tilde{\mu} = \sum_{m=1}^{M} a_m \phi_{\mathbf{z}_m}, \qquad \text{ and }\qquad \tilde{\Sigma} = \left(I + \sum_{i=1}^{M}\sum_{j=1}^M\phi_{\mathbf{z}_i} A_{i,j} \phi_{\mathbf{z}_j}^T\right)^{-1}\,, \end{equation} then, \(\tilde \Sigma \in L^{+}(\mathcal H)\), and the Gaussian measure defined by \((\tilde{\mu}, \tilde{\Sigma})\) is equivalent to a SVGP with variational distribution \(Q(\mathbf{u})=\mathcal{N}(\mathbf{u}|\bm{m},\bm{S})\), defined as \begin{equation} \bm{m} = K(\mathbf{Z}, \mathbf{Z}) \bm a, \quad \text{and} \quad \bm{S} =K(\mathbf{Z}, \mathbf{Z}) - K(\mathbf{Z}, \mathbf{Z}) (\bm A^{-1} + K(\mathbf{Z}, \mathbf{Z}))^{-1} K(\mathbf{Z}, \mathbf{Z})\,. \end{equation}
Proof. Write \(\Psi = \bigl[\phi_{\mathbf z_1} \cdots \phi_{\mathbf z_M}\bigr] :\mathbb R^{M}\to\mathcal H\). Because \(\bm A=\bm A^{\top}\) it verifies that \(\Psi \bm A \Psi^{\top}= (\Psi \bm A \Psi^{\top})^{\ast}\), hence \(\big(\tilde\Sigma^{-1}\big)^{\ast}= \tilde\Sigma^{-1}\). For any \(f\in\mathcal H\), \begin{equation} \langle f,\tilde\Sigma^{-1}( f)\rangle_{\mathcal H} = \|f\|_{\mathcal H}^{2} + \bigl\langle \Psi^{\top}f,\bm A\Psi^{\top}f \bigr\rangle_{\mathbb R^{M}} \ge 0\,, \end{equation} because \(\bm A\succeq0\). Thus \(\tilde\Sigma^{-1}\succeq0\) and \(\tilde{\Sigma}^{-1} \in \mathcal{L}^+(\mathcal{H})\). Thus \(\tilde{\Sigma} \in \mathcal{L}^+(\mathcal{H})\). Recall that from the dual formulation of GPs, the mean function of the GP is defined as \(m^{\star}(\mathbf{x}) := \braket{\phi_{\mathbf{x}}, \mu}\). Thus, it verifies that \begin{equation} m^{\star}(\mathbf{x}) = \braket{\phi_{\mathbf{x}}, \tilde{\mu}} = \braket{\phi_{\mathbf{x}},\sum_{m=1}^{M} a_m \phi_{\mathbf{z}_m}} = \sum_{m=1}^{M} a_m \braket{\phi_{\mathbf{x}}, \phi_{\mathbf{z}_m}} = K(\mathbf{x}, \mathbf{Z}) \bm{a}\,. \end{equation} Using that \(\bm{a} = K(\mathbf{Z}, \mathbf{Z})^{-1} \bm{m}\), the mean function from SVGPs (Titsias, 2009) is recovered. The same procedure can be used for the covariance matrix, using that \(\tilde{\Sigma}^{-1} = I + \Psi \bm A \Psi^{\top}\), applying Woodbury’s matrix identity: \begin{equation} \tilde{\Sigma} = I + \Psi\tilde{\bm{A}}\Psi^\top\,, \quad \text{ where } \tilde{\bm{A}} = -(\bm A^{-1} + K(\mathbf Z, \mathbf Z))^{-1}\,. \end{equation} Thus, the kernel \(K^{\star}(\mathbf{x}, \mathbf{x}') := \braket{\phi_{\mathbf{x}},\tilde\Sigma (\phi_{\mathbf{x}'})}\): \begin{align} K^{\star}(\mathbf{x}, \mathbf{x}') &= \braket{\phi_{\mathbf{x}},\tilde{\Sigma} (\phi_{\mathbf{x}'})} = \braket{\phi_{\mathbf{x}}, \phi_{\mathbf{x}'} + \sum_{i=1}^{M}\sum_{j=1}^M\phi_{\mathbf{z}_i} \tilde A_{i,j} \braket{\phi_{\mathbf{z}_j}, \phi_{\mathbf{x}'}}}\\ &= \braket{\phi_{\mathbf{x}}, \phi_{\mathbf{x}'}} + \sum_{i=1}^{M}\sum_{j=1}^M\braket{\phi_{\mathbf{x}}, \phi_{\mathbf{z}_i}} \tilde A_{i,j} \braket{\phi_{\mathbf{z}_j}, \phi_{\mathbf{x}'}}\\ &= K(\mathbf x, \mathbf x') + \sum_{i=1}^{M}\sum_{j=1}^M K(\mathbf x, \mathbf z_i) \tilde A_{i,j} K(\mathbf z_j, \mathbf x')\\ &= K(\mathbf x, \mathbf x') + K(\mathbf x, \mathbf Z) \tilde{\bm{A}} K(\mathbf Z, \mathbf x')\,. \end{align} From the definition of \(\bm{S}\): \begin{equation} K(\mathbf{Z}, \mathbf{Z})^{-1} - K(\mathbf{Z}, \mathbf{Z})^{-1}\bm{S}K(\mathbf{Z}, \mathbf{Z})^{-1} = (\bm A^{-1} + K(\mathbf{Z}, \mathbf{Z}))^{-1} = -\tilde{\bm{A}} \,. \end{equation} Exchanging \(\tilde{\bm{A}}\) with this quantity recovers the covariance function of SVGPs and completes the proof. □
Proposition 4.2 Return to statement
For any kernel \(K:\cal X \times \cal X \to \mathbb{R}\) and its corresponding RKHS \(\cal H\), if \(g(\cdot, \hat{\bm{\theta}}) \in \mathcal{H}\), then \(\forall \epsilon > 0\), \(\exists M_\alpha \in \mathbb{N}^+\), \(\mathbf{Z}_{\alpha}\in \mathcal{X}^{M_\alpha}\), \(\bm{a} \in \mathbb{R}^{M_\alpha}\) such that, for any \(M_\beta \in \mathbb{N}\), \(\mathbf{Z}_{\beta} \in \cal{X}^{M_\beta}\), \(\bm{A} \in \mathbb{R}^{M_\beta \times M_\beta}\) with \(\bm{A} \succeq 0\), the Gaussian measure \(Q(f) = \mathcal{N}(f|\tilde{\mu}_{\alpha, \bm a}, \tilde{\Sigma}_{\beta, \bm{A}}) \in \cal{Q}^+\) has a corresponding GP with mean and covariance functions defined as \begin{equation} m^{\star}(\mathbf{x}) = \tilde{\mu}_{\alpha, \bm a}(\mathbf{x})\,,\quad K^{\star}(\mathbf{x}, \mathbf{x}') =K(\mathbf{x}, \mathbf{x}') - K_{\mathbf{x}, \mathbf{Z}_\beta}(\bm{A}^{-1} + \bm{K}_\beta)^{-1} K_{\mathbf{Z}_\beta, \mathbf{x}'} \,, \end{equation} where \(d_{\mathcal{H}}(g(\cdot, \hat{\bm{\theta}}), \tilde{\mu}_{\alpha, \bm a}) \leq \epsilon\), with \(d_{\mathcal{H}}(\cdot,\cdot)\).
Proof. If \(g(\cdot, \hat{\bm{\theta}}) \in \mathcal{H}\), the reproducing property of the RKHS verifies that \(\forall \epsilon > 0\) there exists \(\mathbf{Z}_{\alpha} \subset \mathcal{X}\) and \(\{\bm{a}_i\}_{i \in \mathbb{N}}\) such that \(\tilde{\mu}_{\alpha, \bm a}= \sum_{i=1}^{M_\alpha} a_i \phi_{\mathbf{z}_i}\) verifies \begin{equation} d_{\mathcal{H}}(g(\cdot, \hat{\bm{\theta}}), \tilde{\mu}_{\alpha, \bm a}) \leq \epsilon\,. \end{equation} As a result, the mean function of the SVGP is \begin{equation} m^{\star}(\mathbf{x}) = \braket{\phi_{\mathbf{x}}, \tilde{\mu}_{\alpha, \bm a}} = \tilde{\mu}_{\alpha, \bm a}(\mathbf{x}) \approx g(\mathbf{x}, \hat{\bm{\theta}})\,. \end{equation} On the other hand, using that the covariance function is \begin{align} K^{\star}(\mathbf{x}, \mathbf{x}' ) &= \braket{\phi_{\mathbf{x}},\tilde{\Sigma} (\phi_{\mathbf{x}'})} = \braket{\phi_{\mathbf{x}},\phi_{\mathbf{x}'}} - \braket{\phi_{\mathbf{x}},\Phi_{\mathbf{Z}_{\beta}}(\bm{A}^{-1} + \Phi_{\mathbf{Z}_{\beta}}^T\Phi_{\mathbf{Z}_{\beta}})^{-1}\Phi_{\mathbf{Z}_{\beta}}^T \phi_{\mathbf{x}'}}\\ &= K(\mathbf{x}, \mathbf{x}') - K_{\mathbf{x}, \mathbf{Z}_\beta}(\bm{A}^{-1} - K_{\mathbf{Z}_\beta})^{-1} K_{\mathbf{Z}_\beta, \mathbf{x}'}\,. \end{align} □
Proposition 4.3 Return to statement
The value of \(\bm{A}\) that minimizes Equation \(\eqref{eq:opt_dual}\) is \begin{equation} \bm{A} = \frac{1}{\sigma^2} \bm{K}_{\beta}^{-1} \bm{K}_{\bm{Z}_\beta, \bm{X}}\bm{K}_{\bm{X}, \bm{Z}_\beta} \bm{K}_{\beta}^{-1}\,, \end{equation} where \(\sigma^2\) is the noise variance and \(\bm{K}_{\bm{X}, \bm{Z}_\beta}\) is a matrix with the prior covariances between \(f(\mathbf{X})\) and \(f(\mathbf{Z}_\beta)\). If \(\bm{Z}_\beta=\mathbf{X}\), the covariance function of the predictive distribution in Equation \(\eqref{eq:pred_valla}\) is equal to that of the full GP.
Proof. The objective is to maximize the ELBO over the covariance parameter of the decoupled variational family under a Gaussian likelihood \(p(\mathbf y|\mathbf f) = \mathcal N\big(\mathbf y| \mathbf f,\, \sigma^2 \mathbf I\big)\). Let \(\bm{u}_\beta := f(\mathbf Z_\beta)\) with prior \(p(\bm{u}_\beta)=\mathcal N(0,\bm K_\beta)\), where \((\bm K_\beta)_{ij}=K(\mathbf z_{\beta,i},\mathbf z_{\beta,j})\). Consider a variational distribution \(q(\bm{u}_\beta)=\mathcal N(0, \bm{S})\) with some PSD matrix \(S\). Only the covariance part matters for optimizing \(\mathbf A\). Using Gaussian conditioning identities, \begin{equation} \mathrm{Cov}_q(\mathbf f) = \bm K_{\mathbf X, \mathbf X} - \bm K_{\mathbf X, \mathbf Z_\beta}\,\bm K_\beta^{-1}\bm K_{\mathbf Z_\beta, \mathbf X} + \bm K_{\mathbf X, \mathbf Z_\beta}\,\bm K_\beta^{-1} \bm S \bm K_\beta^{-1}\,\bm K_{\mathbf Z_\beta, \mathbf X}. \end{equation} The expected log-likelihood contributes (up to a constant independent of \(\bm S\)): \begin{equation} -\frac{1}{2\sigma^2}\,\mathrm{tr}\left(\bm K_{\mathbf X, \mathbf Z_\beta}\,\bm K_\beta^{-1} \bm S \bm K_\beta^{-1}\,\bm K_{\mathbf Z_\beta, \mathbf X}\right). \end{equation} Define \begin{equation} \bm C := \bm K_\beta^{-1}\bm K_{\mathbf Z_\beta, \mathbf X}\bm K_{\mathbf X, \mathbf Z_\beta}\bm K_\beta^{-1}. \end{equation} Then the term simplifies to \begin{equation} -\frac{1}{2\sigma^2}\,\mathrm{tr}(\bm S\bm C). \end{equation} The KL divergence between \(q(u_\beta)=\mathcal N(0,\bm S)\) and \(p(u_\beta)=\mathcal N(0,\bm K_\beta)\) is \begin{equation} \mathrm{KL}(q|p) = \tfrac12\Big(\mathrm{tr}(\bm K_\beta^{-1}\bm S) - \log\det(\bm K_\beta^{-1}\bm S) - m_\beta\Big)\,, \end{equation} where \(m_\beta = |\mathbf{Z}_\beta|\). Thus the \(\bm S\)–dependent part of the ELBO is \begin{equation} \mathcal L(S) \;=\; -\frac{1}{2\sigma^2}\,\mathrm{tr}(\bm S\bm C) \;-\; \tfrac12\Big(\mathrm{tr}(\bm K_\beta^{-1}\bm S) - \log\det(\bm K_\beta^{-1}\bm S) - m_\beta\Big). \end{equation} Taking derivatives w.r.t. \(\bm S\) and setting to zero: \begin{equation} -\frac{1}{2\sigma^2}\bm C - \tfrac12 \bm K_\beta^{-1} + \tfrac12 \bm S^{-1} = 0. \end{equation} Therefore \begin{equation} \bm S^{-1} = \bm K_\beta^{-1} + \tfrac{1}{\sigma^2}\bm C\,. \end{equation} Applying Woodbury matrix identity: \begin{equation} \bm S = \bm K_\beta - \bm K_\beta(\sigma^2\bm C^{-1} + \bm K_\beta)^{-1} \bm K_\beta\,. \end{equation} In the dual parameterization, \(\bm S\) and \(\bm A\) are related via: \begin{equation} \bm S \;=\; \bm K_\beta - \bm K_\beta ( \bm A^{-1} + \bm K_\beta)^{-1} \bm K_\beta. \end{equation} As a result, it verifies that \begin{equation} \bm A^{-1} = \sigma^2 \bm{C}^{-1} \implies \bm A = \tfrac{1}{\sigma^2}\bm K_\beta^{-1}\bm K_{\mathbf Z_\beta, \mathbf X}\bm K_{\mathbf X, \mathbf Z_\beta}\bm K_\beta^{-1}\,. \end{equation} If \(\mathbf Z_\beta=\mathbf X\), then \(\bm K_{\mathbf Z_\beta,\mathbf X}=\bm K_{\mathbf X,\mathbf X}\) and \(\bm A^{\ast}=\tfrac{1}{\sigma^2}\bm I\), which recovers the full GP posterior covariance. □
Proposition 4.5 Return to statement
Let \(\mathcal{Z}\subset\mathcal{X}\) be compact and let \(K\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}\) be a universal kernel in the sense of Definition 4.4. Then, for every function \(f\in C(\mathcal{Z})\) and for every \(\varepsilon>0\) there exist \begin{equation} M_\alpha\in\mathbb{N},\qquad \bigl\{\mathbf{z}_1,\dots,\mathbf{z}_{M_\alpha}\bigr\}\subset\mathcal{Z}, \quad\text{and}\quad \bigl\{a_1,\dots,a_{M_\alpha}\bigr\}\subset\mathbb{R}, \end{equation} such that the finite kernel expansion \begin{equation} m(\mathbf{x})\;:=\;\bigl\langle\phi_{\mathbf{x}},\tilde{\mu}_{\alpha,\bm{a}}\bigr\rangle_{\mathcal{H}_K} \;=\; \sum_{m=1}^{M_\alpha}a_m\,K\!\bigl(\mathbf{x},\mathbf{z}_m\bigr), \qquad\mathbf{x}\in\mathcal{Z}, \end{equation} satisfies the uniform approximation bound \begin{equation} \bigl\lVert f - m\bigr\rVert_{\infty,\mathcal{Z}}\;\le\;\varepsilon. \end{equation}
Proof. Fix an arbitrary function \(f\in C(\mathcal{Z})\) together with \(\varepsilon>0\). Because \(K\) is universal, Definition 4.4 guarantees the existence of a function \(g_{\varepsilon/2}\in K(\mathcal{Z})\) such that \begin{equation} \bigl\lVert f - g_{\varepsilon/2}\bigr\rVert_{\infty,\mathcal{Z}} \;\le\;\frac{\varepsilon}{2}. \end{equation} By construction, the kernel section admits the representation \begin{equation} K(\mathcal{Z}) \;=\; \overline{\Bigl\{ \sum_{i=1}^{n} a_i\,K(\,\cdot\,,\mathbf{z}_i) : n\in\mathbb{N},\;a_i\in\mathbb{R},\; \mathbf{z}_i\in\mathcal{Z} \Bigr\}}^{\lVert\cdot\rVert_{\infty,\mathcal{Z}}}. \end{equation} Consequently, there exists a finite sum \begin{equation} g(\mathbf{x}) \;=\; \sum_{m=1}^{M_\alpha}a_m\,K \bigl(\mathbf{x},\mathbf{z}_m\bigr), \qquad \mathbf{x}\in\mathcal{Z}, \end{equation} such that \begin{equation} \bigl\lVert g_{\varepsilon/2}-g\bigr\rVert_{\infty,\mathcal{Z}} \;\le\;\frac{\varepsilon}{2}. \end{equation} Combining the two preceding inequalities yields \begin{equation} \bigl\lVert f-g\bigr\rVert_{\infty,\mathcal{Z}} \;\le\; \bigl\lVert f-g_{\varepsilon/2}\bigr\rVert_{\infty,\mathcal{Z}} +\bigl\lVert g_{\varepsilon/2}-g\bigr\rVert_{\infty,\mathcal{Z}} \;\le\;\frac{\varepsilon}{2}+\frac{\varepsilon}{2} \;=\;\varepsilon. \end{equation} □
Diversity and Generalization in Deep Neural Network Ensembles
Theorem 5.1 Return to statement
For any distribution \(\rho\) over \(\bm{\Theta}\), and any of the three considered loss functions for ensembles, there exists a function \(\mathbb{D}(\rho)\) such that \begin{equation} L(\rho) \leq \alpha(\mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho)), \end{equation} where \(\alpha\) equals \(1\) if the \(sq\)-loss or the \(ce\)-loss are considered, and \(4\) for the \(0/1\)-loss. Furthermore, for the \(sq\)-loss, this inequality becomes an equality. The expression of the diversity measure for each of these loss functions is: \[\begin{eqnarray} \mathbb{D}_{sq}(\rho) &:=& \mathbb{E}_\nu\Big[\mathbb{V}_\rho(h_R(\mathbf{x};\bm{\theta}))\Big],\\ \mathbb{D}_{ce}(\rho) &:=& \mathbb{E}_\nu\left[\mathbb{V}_\rho\left(\frac{P(y |\mathbf{x},\bm{\theta})}{\sqrt{2} \max_{\bm{\theta}} P(y |\mathbf{x},\bm{\theta})}\right)\right],\\ \mathbb{D}_{0/1}(\rho) &:=& \mathbb{E}_\nu\Big[\mathbb{V}_\rho\Big(\mathbb{I}(h_W(\mathbf{x};\bm{\theta})\neq y)\Big)\Big], \end{eqnarray}\] where \(\mathbb{V}_\rho(\cdot)\) denotes the variance of a function w.r.t. \(\rho\), and for \(\mathbb{D}_{ce}(\rho)\) to be well-defined it must verify that \(0<\max_{\bm{\theta}\in \bm{\Theta}} P(y | \mathbf{x},\bm{\theta})\leq 1\) for every \((\mathbf{x}, y) \in supp(\nu)\). Finally, note that all diversity terms described above can be written as \begin{equation} \mathbb{D}(\rho) = \mathbb{E}_\nu \Big[ \mathbb{V}_{\rho} \left( f(y,\mathbf{x};\bm{\theta}) \right)\Big], \end{equation} with a specific function \(f\) for each of the loss functions.
Proof. Using the fact that the variance of the classifiers can be decomposed as: \begin{align} \mathbb{V}_{\rho}(f(\bm{\theta})) &= \mathbb{E}_\rho[f(\bm{\theta})^2] - \mathbb{E}_\rho[f(\bm{\theta})]^2 = \mathbb{E}_\rho[f(\bm{\theta})^2] - \mathbb{E}_{\rho^2}[f(\bm{\theta})f(\bm{\theta}')] \\ &= \mathbb{E}_{\rho^2}\Big[f(\bm{\theta})^2 - f(\bm{\theta})f(\bm{\theta}')\Big], \end{align} where \(f\) is determined by the considered loss function: for the \(sq\)-loss, \(f(\bm{\theta})= h_R(\bm{x};\bm{\theta})\), the \(ce\)-loss, \(f(\bm{\theta})= p(y|\bm{x},\bm{\theta})\), and the \(0/1\)-loss \(f(\bm{\theta}) = \mathbb{I}(h(\bm{x};\bm{\theta})\neq y)\). The diversity terms \(\mathbb{D}(\rho)\) defined in Theorem 5.1 can be written as, \begin{align} \mathbb{D}_{sq}(\rho) & = \mathbb{E}_{\rho^2}\Big[\mathbb{E}_\nu\Big[h_R(\bm{x};\bm{\theta})^2 - h_R(\bm{x};\bm{\theta})h_R(\bm{x};\bm{\theta}')\Big]\Big]\\ \mathbb{D}_{0/1}(\rho) & = \mathbb{E}_{\rho^2}\Big[\mathbb{E}_\nu\Big[\mathbb{I}(h(\bm{x};\bm{\theta})= y)\mathbb{I}(h(\bm{x};\bm{\theta}')\neq y)\Big]\Big]\\ \mathbb{D}_{ce}(\rho) & = \mathbb{E}_{\rho^2}\left[\mathbb{E}_\nu\left[\frac{p( y|\bm{x},\bm{\theta})^2 - p(y\mid \bm{x},\bm{\theta})p(y|\bm{x},\bm{\theta}')}{\displaystyle 2\max_{\bm{\theta}\in \bm{\Theta}} p(y|\bm{x},\bm{\theta})^2}\right]\right] \end{align} where \(\rho^2\) is a shorthand for the product distribution \(\rho \times \rho\) over \(\bm{\Theta} \times \bm{\Theta}\) and the shorthand \(\mathbb{E}_{\rho^2}[f(\bm{\theta},\bm{\theta}')] = \mathbb{E}_{\bm{\theta}\sim\rho, \bm{\theta}'\sim\rho}[f(\bm{\theta},\bm{\theta}')]\). Recall the definition of the expected mean squared error of a regression ensemble reparameterized by the distribution \(\rho\), that is, \begin{equation} L_{sq}(\rho) = \mathbb{E}_\nu \left[(y-h_R(\bm{x};\rho))^2\right], \end{equation} where \(h_R(\bm{x}; \rho) = \mathbb{E}_\rho \left[h_R(\bm{x} ; \bm{\theta})\right]\) with \(h_R(\bm{x} ; \bm{\theta})\) an individual regression model. The desired result can be obtained by expanding the square in each of the elements on the right-hand side of the equation, that is: \begin{align} \mathbb{E}\left[L_{sq}(\bm{\theta})\right] &= \mathbb{E}_{\rho, \nu} \left[(y - h_R(\bm{x}; \bm{\theta}))^2\right] = \mathbb{E}_{\rho, \nu} \left[y^2 - 2yh_R(\bm{x}; \bm{\theta}) + h_R(\bm{x} ; \bm{\theta})^2\right]\\ &= \mathbb{E}_\nu \left[y^2 -2y h_R(\bm{x};\rho) + \mathbb{E}_\rho [h_R(\bm{x}; \bm{\theta})^2]\right], \end{align} where the fact that \(y\) is constant under \(\mathbb{E}_\rho\) is used. On the other hand, \begin{align} \mathbb{D}_{sq}(\rho) &= \mathbb{E}_{\nu,\rho} \left[ (h_R(\bm{x}, \bm{\theta}) - \mathbb{E}_\rho[h_R(\bm{x};\bm{\theta})])^2\right] = \mathbb{E}_{\nu,\rho} \left[ (h_R(\bm{x}, \bm{\theta}) - h_R(\bm{x};\rho))^2\right]\\ &= \mathbb{E}_{\nu,\rho} \left[ h_R(\bm{x}, \bm{\theta})^2 -2h_R(\bm{x}, \bm{\theta})h_R(\bm{x}, \rho) + h_R(\bm{x};\rho)^2\right]\\ &= \mathbb{E}_\nu \left[ \mathbb{E}_\rho[h_R(\bm{x}, \bm{\theta})^2] -2h_R(\bm{x}, \rho)^2 + h_R(\bm{x};\rho)^2 \right]. \end{align} Finally, subtracting both expressions: \begin{align} \mathbb{E}_\rho[L_{sq}(\bm{\theta})] - \mathbb{D}_{sq}(\rho) &= \mathbb{E}_\nu \left[ y^2 - 2yh_R(\bm{x}; \rho) + h_R(\bm{x}; \rho)^2\right]\\ &= \mathbb{E}_\nu \left[(y - h_R(\bm{x}; \rho))^2\right]\\ &= L_{sq}(\rho). \end{align} Continuing with the cross-entropy error, the use of Taylor’s theorem with a remainder of second order over the logarithm function is needed. That is, given \(\log x\) and a fixed value \(a > 0\), \begin{equation} \log x = \log a + \frac{1}{a}(x - a) - \frac{1}{2 \xi^2}(x - a)^2, \quad \xi \in (x, a). \end{equation} Applying this to \(p(y|\bm{x}, \bm{\theta})\) centered at \(\mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] > 0\), \begin{align} \log p(y|\bm{x}, \bm{\theta}) &= \log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] + \frac{1}{\mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]}\left( p(y|\bm{x}, \bm{\theta})- \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] \right) \\ &\quad- \frac{1}{2\xi^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\,. \end{align} Taking expectation over \(\rho\) at both sides, \begin{equation} \mathbb{E}_\rho [\log p(y|\bm{x}, \bm{\theta})] = \log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] - \mathbb{E}_\rho \left[\frac{1}{2\xi^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\right]. \end{equation} Rearranging terms, \begin{equation} -\log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]= -\mathbb{E}_\rho [\log p(y|\bm{x}, \bm{\theta})]- \mathbb{E}_\rho \left[\frac{1}{2\xi^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\right]. \end{equation} The desired inequality arises from the fact that \(\xi\) is between \(p(y|\bm{x}, \bm{\theta})\) and \(\mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\), and hence, is upper bounded by \(max_{\bm{\theta} \in \bm{\Theta}} p(y|\bm{x}, \bm{\theta})\). Additionally, the square in the last term is always positive, implying the whole term is positive. Using these two properties, \begin{align} -\log \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})] &\leq -\mathbb{E}_\rho [\log p(y|\bm{x}, \bm{\theta})]\\ &\quad- \mathbb{E}_\rho \left[\frac{1}{2 \max_{\bm{\theta}} p(y|\bm{x}, \bm{\theta})^2}\left(p(y|\bm{x}, \bm{\theta}) - \mathbb{E}_\rho [p(y|\bm{x}, \bm{\theta})]\right)^2\right]. \end{align} Finally, taking expectations w.r.t. \(\nu\) on both sides raises the desired result. Lastly, consider the \(0/1\)-error. In order to prove this result, use Markov’s inequality for monotonically increasing functions, in this case, for \(\psi(a) = a^2\). That is, for a given random variable \(X\), \begin{equation} \mathbb{P}(|x| \geq a) \leq \frac{\mathbb{E}[\psi(|x|)]}{\psi(a)} = \frac{\mathbb{E}[|x|^2]}{a^2}, \quad \text{for any } a > 0. \end{equation} Applying this theorem to \(\mathbb{E}_\rho\left[ \mathbb{I}(h_W(\bm{x} ; \bm{\theta}) \neq y)\right]\), it verifies that \begin{equation} \mathbb{P}\big( \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right] \geq 0.5 \big) \leq 4 \mathbb{E}_\nu\left[ \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right]^2\right]. \end{equation} Notice that when majority vote makes an error, at least half (\(\rho\)-weighted) of the classifiers are wrong, that is, \begin{equation} \mathbb{I}(h_W(\bm{x} ; \rho) \neq y) \leq \mathbb{I}[\mathbb{E}_\rho[\mathbb{I}h_W(\bm{x};\bm{\theta}) \neq y] \geq 0.5]\,, \end{equation} which implies \begin{align} \mathbb{E}_\nu \left[ \mathbb{I} \left(h_W(\bm{x} ; \rho) \neq y\right) \right] &\leq \mathbb{E}_\nu \left[ \mathbb{I} \left( \mathbb{E}_\rho\left[\mathbb{I} \left(h_W(\bm{x};\bm{\theta}) \neq y \right)\right] \geq 0.5\right) \right] = \\ & = \mathbb{P}\left( \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right] \geq 0.5 \right)\,. \end{align} Using the derived inequality of the last term, \begin{equation} L_{0/1}(\rho) = \mathbb{E}_\nu \left[ \mathbb{I} \left(h_W(\bm{x} ; \rho) \neq y\right) \right] \leq 4 \mathbb{E}_\nu\left[ \mathbb{E}_\rho \left[\mathbb{I} \left(h_W(\bm{x}; \bm{\theta}) \neq y\right)\right]^2\right]. \end{equation} To conclude the proof, it remains to show that the right-hand side of the inequality is the desired upper bound on the \(0/1\) loss. Using that \(\operatorname{Var}(X) = \mathbb{E} \left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - \mathbb{E}[X]^2\) and applying this identity to the indicator \(X=\mathbb{I}\big(h_W(\bm{x};\bm{\theta}) \neq y\big)\), it follows that \begin{align} \mathbb{D} _{0/1}(\rho)) &= \mathbb{E}_\nu\left[\mathbb{E}_\rho\left[(\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y) - \mathbb{E}_\rho\left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right])^2\right]\right])\\ &= \mathbb{E}_\nu\left[\mathbb{E}_\rho\left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right] -\mathbb{E}_\rho \left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]^2 \right]\\ &= \mathbb{E}_\rho\Big[\mathbb{E}_\nu\left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]\Big] -\mathbb{E}_\nu \left[\mathbb{E}_\rho \left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]^2 \right]. \end{align} Which implies that \begin{align} 4\left(\mathbb{E}_\rho[L_{0/1}(\bm{\theta})] - \mathbb{D} _{0/1}(\rho)\right) &= 4\Big(\mathbb{E}_\rho\left[ \mathbb{E}_\nu \left[\mathbb{I} (h_W(\bm{x}; \bm{\theta}) \neq y)\right]\right] - \mathbb{D} _{0/1}(\rho)\Big)\\ &= 4\mathbb{E}_\nu \left[\mathbb{E}_\rho \left[\mathbb{I}(h_W(\bm{x};\bm{\theta})\neq y)\right]^2 \right]. \end{align} □
Theorem 5.2 Return to statement
Any distribution \(\rho\) over \(\bm{\Theta}\) satisfies the following inequality, \begin{equation} L_{ce}(\rho)\leq \mathbb{E}_{\rho}[L_{ce}(\bm{\theta})] - \mathbb{V}^T_{ce}(\rho), \end{equation} where \(\mathbb{V}^T_{ce}(\rho)\) is the normalized variance of \(P(y | \mathbf{x}, \bm{\theta})\) w.r.t. \(\rho(\bm{\theta})\), \begin{equation} \mathbb{V}^T_{ce}(\rho) := \mathbb{E}_{\nu}\Big[h(m,\mu)\mathbb{E}_{\rho}\Big[(P(y | \mathbf{x}, \bm{\theta}) - P(y))^2\Big]\Big]. \end{equation} Where \(\mu:=\mathbb{E}_{\rho}[P(y | \mathbf{x}, \bm{\theta})]\), \(m:= \max_{\bm{\theta}} P(y |\mathbf{x}, \bm{\theta})\) and \(h(m,\mu) := \frac{\ln \mu - \ln m }{(m - \mu)^2} + \frac{1}{\mu(m - \mu)}\).
Proof. Apply (Liao & Berg, 2019)’s result to the random variable \(p(\bm{x} | \bm{\theta})\), following the same strategy used in the proof of Theorem 5.1. □
The diversity terms \(\mathbb{D}(\rho)\) defined in Theorem 5.1 satisfy the following properties:
If all the ensemble members provide the same predictions, or if \(\rho\) outs all its probability mass on a single predictor, then \(\mathbb{D}(\rho)\) is null.
\(0\leq \mathbb{D}(\rho) \leq \mathbb{E}_\rho[L(\bm{\theta})]\).
\(\mathbb{D}(\rho)\) is invariant to reparametrizations.
Proof.
Notice that if all models make the same prediction, \(h(\bm{x}; \bm{\theta}) = \mathbb{E}_\rho[h(\bm{x} ; \bm{\theta})]\) and \(p(y|\bm{x}, \bm{\theta}) = \mathbb{E}_\rho[p(y|\bm{x}, \bm{\theta})]\), which nullifies all diversity definitions from Theorem 5.1.
Using that every considered loss function and variance are positive, the given inequality is trivial.
Let \(\phi: \bm{\Omega} \to \bm{\Theta}\) be an injective differentiable function with continuous partial derivatives, with non-zero Jacobian at any point. This result follows from the fact that all considered a probability distributions (ensembles) are a finite mixture of delta distributions, as a result, the distribution is compactly supported and the variable change theorem can be applied to a continuous function \(f: \bm{\Theta} \to \mathbb{R}\): \begin{align} \mathbb{E}_\rho[f(\bm{\theta})] &= \int_{\bm{\Theta}} \rho (\bm{\theta}) f(\bm{\theta}) = \int_{\bm{\Omega}} \rho \circ \phi(\bm\omega) \ f \circ \phi(\bm\omega)\ |det(D\phi)(\bm\omega)| \\ &= \mathbb{E}_{\rho '}[f \circ \phi (\bm\omega)] \end{align} where \begin{equation} \rho '(\bm\omega) = \rho \circ \phi(\bm\omega)\ |det(D\phi)(\bm\omega)|. \end{equation} The result follows from taking \(f(\bm{\theta}) = (\bm{\theta} - \mathbb{E}_\rho (\bm{\theta}))^2\).
□
Theorem 5.4 Return to statement
The diversity terms \(\mathbb{D}(\rho)\) defined in Theorem 5.1 can be written as \begin{equation} \mathbb{D} (\rho) = \mathbb{V}_{\nu\times\rho}\big(f(y,\mathbf{x};\bm{\theta})\big)- \mathbb{E}_{\rho\times\rho}\big[Cov_{\nu}(f(y,\mathbf{x};\bm{\theta}),f(y,\mathbf{x};\bm{\theta}'))\big], \end{equation} where \(\rho\times\rho\) denotes the joint distribution over \(\bm{\theta} \times\bm{\theta}\), \(\rho\times\nu\) denotes the joint distribution over \(\bm{\theta} \times({\cal X},{\cal Y})\), and \(Cov_{\nu}(\cdot,\cdot)\) is the covariance between two models with respect to the data generating distribution \(\nu\).
Proof. This result can be easily shown as follows. Begin with the definition of variance: \begin{equation} \mathbb{D} (\rho) = \mathbb{E}_{\nu}\Big[ \mathbb{V}_{\rho}\big[f(y, \bm{x}; \bm{\theta})\big] \Big] = \mathbb{E}_\nu \Big[ \mathbb{E}_{\rho^2} \big[f(y, \bm{x}; \bm{\theta})^2 - f(y, \bm{x} ;\bm{\theta})f(y, \bm{x} ;\bm{\theta}')\big] \Big]\,. \end{equation} Split the expectation in two \begin{equation} \mathbb{D} (\rho) = \mathbb{E}_{\nu}\Big[ \mathbb{V}_{\rho}\big[f(y, \bm{x}; \bm{\theta})\big] \Big] = \mathbb{E}_\nu \Big[ \mathbb{E}_{\rho} \big[f(y, \bm{x}; \bm{\theta})^2\big] - \mathbb{E}_{\rho^2} \big[f(y, \bm{x} ;\bm{\theta})f(y, \bm{x} ;\bm{\theta}')\big] \Big]\,. \end{equation} On one hand, use the definition of variance again, getting that: \begin{equation} \mathbb{E}_\nu \Big[ \mathbb{E}_{\rho} \big[f(y, \bm{x}; \bm{\theta})^2\big] = \mathbb{V}_{\nu \times \rho}\big[ f(y, \bm{x}; \bm{\theta}) \big] + \mathbb{E}_{\nu \times \rho}\big[ f(y, \bm{x}; \bm{\theta}) \big]^2\,. \end{equation} On the other hand, the definition of covariance gives: \begin{equation} \mathbb{E}_{\rho\times\rho}\Big[Cov_{\nu}\big(f(y,\bm{x};\bm{\theta}),f(y,\bm{x};\bm{\theta}')\big)\Big] = \mathbb{E}_{\nu \times \rho^2} \big[f(y, \bm{x} ;\bm{\theta})f(y, \bm{x} ;\bm{\theta}')\big] \Big] - \mathbb{E}_{\nu \times \rho}\big[ f(y, \bm{x}; \bm{\theta}) \big]^2\,. \end{equation} Using both terms, the proof is completed: \begin{equation} \mathbb{D} (\rho) = \mathbb{E}_{\nu}\Big[ \mathbb{V}_{\rho}\big[f(y, \bm{x}; \bm{\theta})\big] \Big] = \mathbb{V}_{\nu\times\rho}\big[f(y,\bm{x};\bm{\theta})\big] - \mathbb{E}_{\rho\times\rho}\Big[Cov_{\nu}\big(f(y,\bm{x};\bm{\theta}),f(y,\bm{x};\bm{\theta}')\big)\Big]\,. \end{equation} □
Theorem 5.7 Return to statement
For any prior distribution \(\pi\) over \(\bm{\theta}\) independent of \(D\), any \(\xi\in (0,1)\), and any \(\lambda>0\), with probability at least \(1-\xi\) over draws of training data \(D\sim \nu^n\), for all distributions \(\rho\) over \(\bm{\theta}\), simultaneously, \begin{equation} L(\rho) \leq \alpha\left(\mathbb{E}_\rho[\hat L(\bm{\theta},D)] - \hat{\mathbb{D}}(\rho,D) + \frac{2\mathrm{KL}(\rho|\pi) + \epsilon(\nu,\pi,\lambda,n,\xi)}{\lambda\, n}\right) \end{equation}
where \(\alpha\) is equal to \(1\) for the \(sq\)-loss or the \(ce\)-loss, and \(4\) for the \(0/1\)-loss. Furthermore, \(\epsilon\) is a positive function that is independent of \(\rho\) but also depends on the specific loss (the functional forms of these \(\epsilon\) terms can be found in the proof).
Proof. In order to prove this theorem, the r.h.s. term is shown to be an upper bound for \(\alpha(\mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho))\), which, using Theorem 5.1 concludes the proof. First of all, consider the following tandem losses: \begin{align} L_{sq}(\bm{\theta}, \bm{\theta}') & = L_{sq}(\bm{\theta})-\mathbb{E}_\nu\Big[h_R(\bm{x};\bm{\theta})^2 - h_R(\bm{x};\bm{\theta})h_R(\bm{x};\bm{\theta}')\Big]\,,\\ L_{0/1}(\bm{\theta}, \bm{\theta}') & = L_{0/1}(\bm{\theta}) - \mathbb{E}_\nu\Big[\mathbb{I}(h(\bm{x};\bm{\theta})= y)\mathbb{I}(h(\bm{x};\bm{\theta}')\neq y)\Big]\,,\\ L_{ce}(\bm{\theta}, \bm{\theta}') & = L_{ce}(\bm{\theta})- \mathbb{E}_\nu\left[\frac{p(y|\bm{x},\bm{\theta})^2 - p(y|\bm{x},\bm{\theta})p(y|\bm{x},\bm{\theta}')}{\displaystyle 2\max_{\bm{\theta}\in \bm{\Theta}} p(y|\bm{x},\bm{\theta})^2}\right]\,, \end{align} which verify \begin{equation} \mathbb{E}_{\rho^2}[L(\bm{\theta}, \bm{\theta}')] = \mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho). \end{equation} This can be shown using the fact that the variance of the classifiers can be decomposed as \begin{align} \operatorname{Var}_{\rho}\bigl(f(\bm \theta)\bigr) &= \mathbb{E}_{\rho}\left[f(\bm \theta)^2\right] - \mathbb{E}_{\rho}\left[f(\bm \theta)\right]^2 = \mathbb{E}_{\rho}\left[f(\bm \theta)^2\right] - \mathbb{E}_{\rho^2}\left[f(\bm \theta) f(\bm \theta')\right] \\ &= \mathbb{E}_{\rho^2}\left[ f(\bm \theta)^2 - f(\bm \theta) f(\bm \theta') \right]\,, \end{align} where the function \(f\) is determined by the loss under consideration: for the squared loss, \(f(\bm \theta)=h_R(\mathbf x;\bm \theta)\); for the cross-entropy loss, \(f(\bm \theta)=p(y|\mathbf x,\bm \theta)\); and for the \(0/1\)-loss, \(f(\bm \theta)=\mathbb{I}\bigl(h(\mathbf x;\theta)\neq y\bigr)\).
Applying Germain et al. (2016, Theorem 3) to the tandem loss functionals described above with a prior distribution \(\pi(\bm{\theta}, \bm{\theta}') = \pi(\bm{\theta})\pi(\bm{\theta}')\) raises that for any \(\lambda n > 0\) and \(\delta \in (0, 1]\), with probability at least \(1 - \delta\): \begin{equation} \mathbb{E}_\rho[L(\bm{\theta})] - \mathbb{D}(\rho) \leq \mathbb{E}_{\rho(\bm{\theta}, \bm{\theta}')}[ \hat{L}(\bm{\theta}, \bm{\theta}')] + \frac{1}{\lambda n} \left[ \mathrm{KL}(\rho(\bm{\theta}, \bm{\theta}')|\pi(\bm{\theta}, \bm{\theta}')) + \epsilon(\nu,\pi,\lambda,n,\xi) \right]. \end{equation} Where \begin{equation} \epsilon(\nu,\pi,\lambda,n,\xi) := \log \mathbb{E}_{\pi(\bm{\theta}, \bm{\theta}')} \left[\mathbb{E}_\nu\left[\exp \left(\lambda\left(L(\bm{\theta}, \bm{\theta}') - \hat{L}(\bm{\theta}, \bm{\theta}', D)\right)\right)\right]\right] + \log \frac{1}{\delta}, \end{equation} and \(\mathrm{KL}(\rho(\bm{\theta}, \bm{\theta}')|\pi(\bm{\theta}, \bm{\theta}')) = 2\mathrm{KL}(\rho|\pi)\). In short: \begin{equation} L(\rho) \leq \alpha\left( \mathbb{E}_\rho[\hat{L}(\bm{\theta}, D)] - \hat{\mathbb{D}}(\rho, D) + \frac{1}{\lambda n} \Big[ 2\operatorname{KL}(\rho|\pi) + \epsilon(\nu,\pi,\lambda,n,\xi) \Big]\right). \end{equation} □
Generalization Error and Chernoff Bounds
Proposition 5.12 Return to statement
Under Assumption 5.9, \(\forall\bm{\theta}\in\bm{\Theta}\), \(\mathcal{I}_{\bm{\theta}}(\cdot)\) and \(\mathcal{I}^{-1}_{\bm{\theta}}(\cdot)\), are well defined. That is, \(\forall a\in[0,L(\bm{\theta})-m_{\bm{\theta}})\), \(\mathcal{I}_{\bm{\theta}}(a)<\infty\) and \(\forall s\in\mathbb{R}^{+}_{0}\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)<\infty\).
Proof. First, from Assumption 5.9, it verifies that \(m_{\bm{\theta}}\geq 0\), then \(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) \geq 0 \ \forall (\mathbf{x}, \mathbf{y}) \in \mathcal{X}\times\mathcal{Y}\). Then, \(\lambda(L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta})))\leq \lambda L(\bm{\theta})\). Taking exponential and expectations, \begin{equation} \mathbb{E}_{\nu}\left[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\right] \leq \mathbb{E}_{\nu}\left[e^{\lambda L(\bm{\theta})}\right]\,. \end{equation} Lastly, the expectation on the r.h.s. is constant, leading to \begin{equation} J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu}\left[e^{\lambda (L(\bm{\theta})-\ell(\mathbf{y},\mathbf{x},\bm{\theta}))}\right] \leq \ln \mathbb{E}_{\nu}\left[e^{\lambda L(\bm{\theta})}\right] = \lambda L(\bm{\theta})\,. \end{equation} As, by Assumption 5.9, \(\forall\bm{\theta}\in\bm{\Theta}\), \(L(\bm{\theta})<\infty\) and the function \(J_{\bm{\theta}}(\lambda)\) is well-defined for \(\lambda>0\). It is left to show that the supremum over \(\lambda\) is reached in the definition of the rate function. For that, it will be shown that it is actually a maximum. Firstly, \begin{equation} \frac{\partial}{\partial \lambda} (\lambda a - J_{\bm{\theta}}(\lambda)) = a - \frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda)\,, \end{equation} where the second derivative is negative as \(\frac{\partial^2}{\partial \lambda ^2} J_{\bm{\theta}}(\lambda) \geq 0\) (the cumulant is convex). As a result, when the previous derivative is zero, the maximum is reached. In fact the optimum of \(\lambda\) is a \(\lambda^\star\) such that \(a=\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda^\star)\). Then, \(\forall a \in (0, L(\bm{\theta}) - m_{\bm{\theta}})\), it is necessary to show \(\exists \lambda^\star \in \mathbb{R}^+\). It verifies that \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda)\) is a continuous function, as it is combination of continuous functions: \begin{equation} \frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda) = \frac{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \ln p(\mathbf{y} | \mathbf{x}, \bm{\theta})]}{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]} - \mathbb{E}_{\nu}[\ln p(\mathbf{y}|\mathbf{x}, \bm{\theta})]\,. \end{equation} By standard properties of the cumulant generating function, \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(0) = 0\). On the other had, also by standard properties of the cummulant (Herdegen, 2008, Lemma 1), \begin{equation} \label{eq:limitGradientJ} \lim_{\lambda\rightarrow \infty } \frac{\partial}{\partial \lambda}J_{\bm{\theta}}(\lambda) = ess\sup_{(\mathbf{x}, \mathbf{y})}\ L(\bm{\theta}) - \ell(\mathbf{y},\mathbf{x},\bm{\theta}) = L(\bm{\theta}) - m_{\bm{\theta}}\,. \end{equation} Thus, if \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda)\) is continuous, \(\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(0) = 0\) and \(\lim_{\lambda\rightarrow \infty } \frac{\partial}{\partial \lambda}J_{\bm{\theta}}(\lambda)=L(\bm{\theta}) - m_{\bm{\theta}}\), then \(\forall a \in [0, L(\bm{\theta}) - m_{\bm{\theta}})\) there always exist a \(\lambda^\star \in \mathbb{R}^+\) such that \(a=\frac{\partial}{\partial \lambda} J_{\bm{\theta}}(\lambda^\star)\). Thus, for such values of \(a\) the rate function is finite and well defined.
For \(a\geq L(\bm{\theta})- m_{\bm{\theta}}\), it verifies that the supremum is reached when \(\lambda\to \infty\), because \(J_{\bm{\theta}}(\lambda)\) is monotonically increasing. Every \(a\geq L(\bm{\theta})- m_{\bm{\theta}}\), can be written as \(a=L(\bm{\theta})-b\), where \(b\leq m_{\bm{\theta}}\). Then the limit when \(\lambda\to \infty\) for any \(a\geq L(\bm{\theta})- m_{\bm{\theta}}\) can be written as follows, \begin{align} \lim_{\lambda\rightarrow\infty} \lambda (L(\bm{\theta})- b) - J_{\bm{\theta}}(\lambda) &= \lim_{\lambda\rightarrow\infty} -\lambda b -\ln \mathbb{E}_\nu\left[p(\mathbf{y}|\mathbf{x},\bm{\theta})^\lambda\right] = \lim_{\lambda\rightarrow\infty} -\ln \mathbb{E}_\nu\left[\left(\frac{p(\mathbf{y}|\mathbf{x},\bm{\theta})}{e^{-b}}\right)^\lambda\right]\\ &= -\ln \mathbb{E}_\nu\left[\lim_{\lambda\rightarrow\infty}\left(\frac{p(\mathbf{y}|\mathbf{x},\bm{\theta})}{e^{-b}}\right)^\lambda\right] = -\ln \mathbb{E}_\nu\left[\mathbb{I}(p(\mathbf{y}|\mathbf{x},\bm{\theta})=e^{-b}\right]\\ &= -\ln \mathbb{E}_\nu[\mathbb{I}(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=b)] = -\ln \mathbb{P}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=b)\,. \end{align} If \(b< m_{\bm{\theta}}\) or, equivalently, \(a>L(\bm{\theta})-m_{\bm{\theta}}\), then \(\mathcal{I}_{\bm{\theta}}(a)=\infty\). Moreover, if \(a=L-m_{\bm{\theta}}\), then \(\mathcal{I}_{\bm{\theta}}(a) = -\ln \mathbb{P}_\nu(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\), which may be well-defined or equal to \(\infty\). By definition of the inverse rate function, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s)=\inf_{\lambda\geq 0}\frac{s+J_{\bm{\theta}}(\lambda)}{\lambda}\leq \lim_{\lambda\rightarrow\infty}\frac{s+J_{\bm{\theta}}(\lambda)}{\lambda} =\lim_{\lambda\rightarrow\infty}\frac{J_{\bm{\theta}}(\lambda)}{\lambda} \end{equation} Then, by L’Hôpital’s rule and Equation \(\eqref{eq:limitGradientJ}\), it verifies that \begin{equation} \lim_{\lambda\rightarrow\infty}\frac{J_{\bm{\theta}}(\lambda)}{\lambda} =\lim_{\lambda\rightarrow\infty}\nabla_\lambda J_{\bm{\theta}}(\lambda) =L(\bm{\theta}) - m_{\bm{\theta}} \end{equation} From Assumption 5.9, \(L(\bm{\theta})<\infty\) and \(m_{\bm{\theta}}\geq 0\), then, \(\forall s\geq 0\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)<\infty\). □
Theorem 5.14 Return to statement
For any fixed \(\bm{\theta}\in\bm{\Theta}\) and \(a>0\), it satisfies \begin{equation} \mathbb{P}_{D\sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq a\Big)\leq e^{-n \mathcal{I}_{\bm{\theta}}(a)}\,. \end{equation}
Proof. Cramér-Chernoff’s bound states that for any random variable \(X\), it verifies that \(P(X \geq s) \leq \inf_{t > 0} \mathbb{E}[e^{t X}]e^{-t s}\). Applying this result to the random variable over possible datasets \(\hat{L}(D,\bm{\theta}) - L(\bm{\theta})\), for a fixed \(\bm{\theta}\in\bm{\Theta}\), leads to \begin{equation} P( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq s) \leq \inf_{t > 0} \mathbb{E}\left[e^{t ( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) ) }\right]e^{- t s}\,. \end{equation} The expectation in the r.h.s. can be transformed as \begin{align} \mathbb{E}\left[e^{t ( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) )}\right] &= \mathbb{E}\left[e^{t ( \tfrac{1}{n}\ln p(D|\bm{\theta})- \mathbb{E}_\nu[\ln p(\mathbf{y}| \mathbf{x}, \bm{\theta})]}\right] = \mathbb{E}\left[e^{\tfrac{t}{n}\ln p(D|\bm{\theta})}\right] e^{-t\mathbb{E}_\nu[\ln p(\mathbf{y}| \mathbf{x}, \bm{\theta})]}\,. \end{align} Moreover, the first expectation in this last term is \begin{align} \mathbb{E}\left[e^{\tfrac{t}{n}\ln p(D|\bm{\theta})}\right] &= \mathbb{E}_{\nu^n}\left[p(D|\bm{\theta})^{\tfrac{t}{n}}\right] = \mathbb{E}_{\nu}\left[P(\mathbf{y}|\mathbf{x},\bm{\theta})^{\tfrac{t}{n}}\right]^n =\mathbb{E}\left[e^{\tfrac{t}{n}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})}\right]^n\,. \end{align} Using this, and parameterizing \(t\) as \(\lambda n\), with \(\lambda > 0\), \begin{equation} P(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq s) \leq \inf_{\lambda > 0} \mathbb{E}\left[e^{\lambda ( L(\bm{\theta}) - \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) }\right]^n e^{- \lambda n s}\,. \end{equation} Taking exponential and logarithm on the r.h.s, and using the definition of the smoothness function \(J_{\bm{\theta}}(\lambda)\) and the rate function \(\mathcal{I}_{\bm{\theta}}(a)\): \begin{align} P(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq s) &\leq \inf_{\lambda > 0} e^{ n \ln \mathbb{E}\left[e^{\lambda (L(\bm{\theta}) - \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))}\right] - \lambda ns} \leq \inf_{\lambda > 0} e^{nJ_{\bm{\theta}}(\lambda) - \lambda n s} = e^{-n \mathcal{I}_{\bm{\theta}}(s)}\,. \end{align} □
Proposition 5.15 Return to statement
For any fixed \(\bm{\theta}\in\bm{\Theta}\) and \(n>0\), it satisfies \begin{equation} \lim_{a\rightarrow L(\bm{\theta}) - m_{\bm{\theta}}}\mathbb{P}_{D\sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq a\Big) = \lim_{a\rightarrow L(\bm{\theta}) - m_{\bm{\theta}}} e^{-n \mathcal{I}_{\bm{\theta}}(a)}\,. \end{equation}
Proof. It is clear that the limit on both sides is \begin{equation} \mathbb{P}_{D \sim \nu_ n}(\hat{L}(D, \bm{\theta})=m_{\bm{\theta}}) = n\mathbb{P}_{(\mathbf y, \mathbf x) \sim \nu}(\ell(\mathbf{y},\mathbf{x},\bm{\theta})=m_{\bm{\theta}})\,. \end{equation} □
Theorem 5.16 Return to statement
For any fixed \(\bm{\theta}\in \bm{\Theta}\) and any \(a>0\), it satisfies \begin{equation} \lim_{n\rightarrow \infty} -\frac{1}{n}\log \mathbb{P}_{D \sim \nu^n}\Big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq a\Big) = \mathcal{I}_{\bm{\theta}}(a)\,. \end{equation}
Proof. Direct application of Cramér’s Theorem (Cramér, 1938; Ellis, 2012), over the random variable \(X=L(\bm{\theta}) - L(D,\bm{\theta})\), for a fixed \(\bm{\theta}\), where the randomness comes from \(D\sim\nu^n\). Similar to the application of Chernoff’s bound in Theorem 5.14. □
Theorem 5.17 Return to statement
[PAC-Chernoff Bound] With h.p. \(1 - \delta\) over \(D \sim \nu^n\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} L(\bm{\theta}) \leq \hat{L}(D, \bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta})\,. \end{equation}
Proof. By Chernoff’s Theorem 5.14, for a given \(\bm{\theta}\), it verifies that \(\mathbb{P}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq a\Big)\leq e^{-n \mathcal{I}_{\bm{\theta}}(a)}\). Naming \(\delta' = e^{-n \mathcal{I}_{\bm{\theta}}(a)}\) and re-arranging terms, \(a = \mathcal{I}^{-1}_{\bm{\theta}}(-\frac{1}{n}\ln \delta')\). As a result: \begin{equation} \mathbb{P}\Big(L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\leq \delta'\,. \end{equation} Taking the complementary probability and using the union bound over the set of models, \begin{equation} \mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\leq \sum_{\bm{\theta}\in\bm{\Theta}} \mathbb{P}\Big( L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\,. \end{equation} Given that the model space considers \(k^p\) different models, the r.h.s. can be rewritten as \begin{equation} \mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln \tfrac{1}{\delta'}) \Big)\leq k^p \delta' \,. \end{equation} By reparameterizing the above inequality with \(\delta'=\delta k^{-p}\): \begin{equation} \mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \Big)\leq \delta \,. \end{equation} Which verifies, \begin{equation} 1-\mathbb{P}\Big( \bigcup_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \Big)\geq 1-\delta\,. \end{equation} Which is equivalent to, \begin{equation} \mathbb{P}\Big( \bigcap_{\bm{\theta}\in\bm{\Theta}} L(\bm{\theta}) - \hat{L}(D,\bm{\theta}) \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \Big)\geq 1-\delta \,. \end{equation} □
Corollary 5.18 Return to statement
With h.p. \(1 - \delta\) over \(D \sim \nu^n\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} L(\bm{\theta}) \leq \hat{L}(D, \bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{|\bar{\bm{\Theta}}|}{\delta})\,. \end{equation}
Proof. The proof follows the same approach as Theorem 5.17, with the union bound applied to the models in \(\bar{\bm{\Theta}}\) rather than directly to the models in \(\bm{\Theta}\). This adjustment is valid because, when applying the union bound, only the total number of distinct random variables needs to be considered; in this case, \(L(D,\bm{\theta})\) with \(D\sim\nu^n\). Models in \(\bm{\Theta}\) that are not in \(\bar{\bm{\Theta}}\) define random variables that are duplicated by definition and therefore do not need to be accounted for in the application of the union bound. □
Theorem 5.19 Return to statement
For any \(\delta \in (0, 1)\) and any \(\bm{\theta}\in \bm{\Theta}\), it verifies that \begin{equation} \lim_{n\rightarrow\infty} \ \sqrt{n}\, \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) = \sqrt{2\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\log \tfrac{k^p}{\delta}}\,. \end{equation}
Proof. Let \(\bar{Z}_n := \sqrt{n} (L(\bm{\theta}) - L(D,\bm{\theta}))\) and \(J_{\bar{Z}_n}, {\cal I}_{\bar{Z}_n}\) and \({\cal I}^{-1}_{\bar{Z}_n}\) denote its cumulant, rate and inverse rate function. Then, by properties of the cumulant, it verifies that \(J_{\bar{Z}_n}(\lambda) = nJ_{\bm{\theta}}(\lambda/\sqrt{n})\). Using the definition of the rate function: \begin{equation} {\cal I}_{\bar{Z}_n}(a) =\sup_{\lambda} \ a\lambda+n J_{\bm{\theta}}(\lambda/\sqrt{n})=n\sup_{\lambda} \ \frac{\lambda}{\sqrt{n}}\frac{a}{\sqrt{n}}+ J_{\bm{\theta}}(\lambda/\sqrt{n})=n\mathcal{I}_{\bm{\theta}}(a/\sqrt{n})\,. \end{equation} Using a second order Taylor expansion over \(\mathcal{I}_{\bm{\theta}}(a)\) around \(a=0\), where \(\mathcal{I}_{\bm{\theta}}(0) = 0\) and \(\frac{\partial}{\partial a}\mathcal{I}_{\bm{\theta}}(0) = 0\): \begin{equation} \mathcal{I}_{\bm{\theta}}(a/\sqrt{n}) = \frac{1}{2} \frac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) \left(\frac{a}{\sqrt{n}}\right)^2 + \left(\frac{a}{ \sqrt{n}}\right)^2 h(a/ \sqrt{n})\,, \end{equation} where \(\lim_{t\to 0} h(t) = 0\). Furthermore, it verifies that (Ellis, 2012) \begin{equation} \frac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) = \left( \frac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(0)\right)^{-1}\,. \end{equation} From there: \begin{align} \frac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(\lambda)&= - \frac{\partial}{\partial \lambda} \frac{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})} \ell(\mathbf{y},\mathbf{x},\bm{\theta})]}{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})}]}\\ &= \frac{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})} (\ell(\mathbf{y},\mathbf{x},\bm{\theta}))^2]}{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})}]} - \frac{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})} \ell(\mathbf{y},\mathbf{x},\bm{\theta})]^2}{\mathbb{E}_\nu[ e^{-\lambda \ell(\mathbf{y},\mathbf{x},\bm{\theta})}]^2}. \end{align} Which evaluated at \(\lambda=0\), gives \begin{equation} \frac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(0)= \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\,. \end{equation} Let this last quantity be written as \(\sigma^2 := \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\). Then, \begin{align} \lim_{n\rightarrow\infty} {\cal I}_{\bar{Z}_n}(a) &= \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}}(a/\sqrt{n}) = \lim_{n\rightarrow\infty} n \frac{1}{2} \sigma^{-2} \left(\frac{a}{ \sqrt{n}}\right)^2 + n \left(\frac{a}{ \sqrt{n}}\right)^2 h(a/ \sqrt{n})\\ &=\frac{1}{2} \sigma^{-2} a^2\,. \end{align} Then, as both the rate and its inverse are continuous functions, it verifies that \begin{equation} \lim_{n\rightarrow\infty} {\cal I}^{-1}_{\bar{Z}_n}(s) = \left(\lim_{n\rightarrow\infty} {\cal I}_{\bar{Z}_n}\right)^{-1}(s) = \sqrt{2\sigma^2s} \end{equation} By definition of the inverse rate: \begin{align} {\cal I}^{-1}_{\bar{Z}_n}(s) &= \inf_{\lambda} \frac{s+J_{\bar{Z}_n}(\lambda)}{\lambda}=\inf_{\lambda} \frac{s+n J_{\bm{\theta}}(\lambda/\sqrt{n})}{\lambda}\\ &=\sqrt{n}\inf_{\lambda} \frac{\frac{s}{n}+n J_{\bm{\theta}}(\lambda/\sqrt{n})}{\frac{\lambda}{\sqrt{n}}}=\sqrt{n}\mathcal{I}^{-1}_{\bm{\theta}}(s/n)\,. \end{align} As a result, \begin{equation} \lim_{n\rightarrow\infty} \sqrt{n}\mathcal{I}^{-1}_{\bm{\theta}}(s/n) = \lim_{n\rightarrow\infty} {\cal I}^{-1}_{\bar{Z}_n}(s) = \sqrt{2\sigma^2s}\,. \end{equation} □
Proposition 5.20 Return to statement
With h.p. \(1 - \delta\) over \(D \sim \nu^n\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} \text{if } \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad\text{then}\quad 0 \leq L(\bm{\theta}) - \mathcal{I}^{-1}_{\bm{\theta}}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \epsilon\,. \end{equation}
Proof. By Theorem 5.17, \(L(\bm{\theta}) \leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\), and, by Proposition 5.12, \(\mathcal{I}_{\bm{\theta}}(a)\) is well defined \(\forall a\in[0,L(\bm{\theta})-m_{\bm{\theta}})\); and, \(\forall b > 0\), it verifies that \(\mathcal{I}^{-1}_{\bm{\theta}}(b)\in [0,L(\bm{\theta})-m_{\bm{\theta}})\). In consequence: \begin{equation} L(\bm{\theta}) \leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq L(D,\bm{\theta}) + L(\bm{\theta}) -m_{\bm{\theta}}\,. \end{equation} The result follows from \(m_{\bm{\theta}}\geq0\), \(L(D,\bm{\theta})\leq \epsilon\), and rearranging terms. □
Proposition 5.22 Return to statement
For any \(\bm{\theta} \in \bm{\Theta}\) and \(\bm{\theta}'\in\bm{\Theta}'\), \begin{equation} \exists \beta > 0 \text{ s.t. } \bm{\theta} \text{ is } \beta\text{-smoother than }\bm{\theta}' \iff \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}' ))\,. \end{equation}
Proof. Let \(\bar{Z}_n := \sqrt{n} (L(\bm{\theta}) - L(D,\bm{\theta}))\) and \(J_{\bar{Z}_n}, {\cal I}_{\bar{Z}_n}\) and \({\cal I}^{-1}_{\bar{Z}_n}\) denote its cumulant, rate, and inverse rate function. Then, by properties of the cumulant, it verifies that \(J_{\bar{Z}_n}(\lambda) = nJ_{\bm{\theta}}\big(\tfrac{\lambda}{\sqrt{n}}\big)\). Using the definition of the rate function: \begin{equation} {\cal I}_{\bar{Z}_n}(a) =\sup_{\lambda} \ a\lambda+n J_{\bm{\theta}}\big(\tfrac{\lambda}{\sqrt{n}}\big)=n\sup_{\lambda} \ \tfrac{\lambda}{\sqrt{n}}\tfrac{a}{\sqrt{n}}+ J_{\bm{\theta}}\big(\tfrac{\lambda}{\sqrt{n}}\big)=n\mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}})\,. \end{equation} Using a second order Taylor expansion over \(\mathcal{I}_{\bm{\theta}}(a)\) around \(a=0\), where \(\mathcal{I}_{\bm{\theta}}(0) = 0\) and \(\frac{\partial}{\partial a}\mathcal{I}_{\bm{\theta}}(0) = 0\), it verifies that \begin{equation} \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) = \tfrac{1}{2} \tfrac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) \big(\tfrac{a}{ \sqrt{n}}\big)^2 + \big(\tfrac{a}{ \sqrt{n}}\big)^2 h\big(\tfrac{a}{ \sqrt{n}}\big)\,, \end{equation} where \(\lim_{t\to 0} h(t) = 0\). Furthermore, it verifies that \begin{equation} \tfrac{\partial^2}{\partial a^2} \mathcal{I}_{\bm{\theta}}(0) = \left( \tfrac{\partial^2}{\partial \lambda^2}J_{\bm{\theta}}(0)\right)^{-1}= \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-1} =: \sigma^{-2}\,. \end{equation} Then, \begin{equation} \label{eq:limitrate} \lim_{n\rightarrow\infty} {\cal I}_{\bar{Z}_n}(a) = \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) = \lim_{n\rightarrow\infty} n \frac{1}{2} \sigma^{-2} \big(\tfrac{a}{ \sqrt{n}}\big)^2 + n \big(\tfrac{a}{ \sqrt{n}}\big)^2 h\big(\tfrac{a}{ \sqrt{n}}\big)=\frac{1}{2} \sigma^{-2} a^2\,. \end{equation}
Let us assume that \(\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}' ))\). Then, the aim is to show that there exists \(\beta > 0\) such that \(\mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a) \quad \forall a \leq \beta\). For a fixed value of \(a\), by the limit in Equation \(\eqref{eq:limitrate}\), it verifies that for any \(\epsilon > 0\), there exists \(n_0(a), n_0'(a) > 0\) such that \begin{equation} \big| n \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) - \tfrac{1}{2}\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-2}a^2\big| < \epsilon \quad \forall n > n_0(a)\,, \end{equation} and \begin{equation} \big| n \mathcal{I}_{\bm{\theta}'}(\tfrac{a}{\sqrt{n}}) - \tfrac{1}{2}\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))^{-2}a^2\big| < \epsilon \quad \forall n > n_0'(a)\,. \end{equation} Let \(\epsilon = \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-2}a^2 - \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))^{-2}a^2 \geq 0\) and \(n_0^\star(a) = \max\{n_0(a), n_0'(a)\}\). Then, \begin{equation} \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) \geq \mathcal{I}_{\bm{\theta}'}(\tfrac{a}{\sqrt{n}}) \quad \forall n > n_0^\star(a) \,. \end{equation} This implies that \(\mathcal{I}_{\bm{\theta}}(c) \geq \mathcal{I}_{\bm{\theta}'}(c)\), for any \(c \leq a/\sqrt{n_0^\star(a)}\), as there exists \(n = (a/c)^2\) verifying \(c = a/\sqrt{n}\). Consider then \(\beta := \sup_{a> 0 } \{a/\sqrt{n_0^\star(a)}\} > 0\), which might be infinite. It verifies that \begin{equation} \mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a) \quad \forall a \leq \beta\,. \end{equation} Let us now assume that there exists \(\beta > 0\) such that \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\), or, equivalently \(\mathcal{I}_{\bm{\theta}}(a) \geq \mathcal{I}_{\bm{\theta}'}(a) \quad \forall a \leq \beta\). Then, \begin{equation} \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}}(\tfrac{a}{\sqrt{n}}) \geq \lim_{n\rightarrow\infty} n \mathcal{I}_{\bm{\theta}'}(\tfrac{a}{\sqrt{n}}) \,. \end{equation} Using the limit in Equation \(\eqref{eq:limitrate}\), \begin{equation} \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))^{-2}a^2 \geq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))^{-2}a^2\,, \end{equation} leading to \begin{equation} \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) \leq \mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}'))\,. \end{equation} □
Theorem 5.23 Return to statement
For any \(\epsilon\geq 0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta} \in \bm{\Theta} \subset \mathbb{R}^p\) and \(\bm{\theta}'\in\bm{\Theta}'\), simultaneously, \begin{equation} \text{if $\hat{L}(D, \bm{\theta})\leq \epsilon$ and $\bm{\theta}$ is \ \(\mathcal{I}^{-1}_{\bm{\theta}} \big(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}\big)\)-smoother than $\bm{\theta}'$, then, $L(\bm{\theta})\leq L(\bm{\theta}')+\epsilon$}\,. \end{equation}
Proof. If \(\bm{\theta}\) is \(\beta\)-smoother than \(\bm{\theta}'\), by Definition 5.21, \(\forall a\in(0,\beta] \quad \mathcal{I}_{\bm{\theta}}(a)\geq \mathcal{I}_{\bm{\theta}'}(a)\), where \(\beta = \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Then, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} As the rate function \(\mathcal{I}_{\bm{\theta}'}(a)\) is invertible and its image lies in \([0,L(\bm{\theta}')-m_{\bm{\theta}'})\), where \(m_{\bm{\theta}'}\geq 0\), due to Assumption 5.9, it verifies that \(\mathcal{I}^{-1}_{\bm{\theta}'}(s)\in[0,L(\bm{\theta}')-m_{\bm{\theta}'})\). In consequence, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq L(\bm{\theta}')\,. \end{equation} As \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\), it verifies \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq L(\bm{\theta}')\). By the PAC-Chernoff bound of Theorem 5.17 and because \(L(D,\bm{\theta})\leq \epsilon\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), \begin{equation} L(\bm{\theta})\leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} Combining the last two inequalities, \begin{equation} L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \leq \epsilon + L(\bm{\theta}')\,. \end{equation} The statement of the theorem directly derives from the above inequality. □
Theorem 5.26 Return to statement
For any \(\epsilon>0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), \(|L(\bm{\theta}^\star_\epsilon) - L(\bm{\theta}^{\times}_\epsilon)|\leq \epsilon\).
Proof. From the definitions of \(\bm{\theta}^{\times}_\epsilon\) and \(\bm{\theta}^\star_\epsilon\), given by \begin{equation} \bm{\theta}^\star_\epsilon = \argmin_{\bm{\theta}\,:\,L(D,\bm{\theta})\, \leq \, \epsilon}\, L(\bm{\theta})\,, \quad\quad \bm{\theta}^{\times}_\epsilon = \argmin_{\bm{\theta}\,:\,L(D,\bm{\theta})\, \leq \, \epsilon} \, L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,, \end{equation} it is clear that \(L(\bm{\theta}^\star_\epsilon) \leq L(\bm{\theta}^{\times}_\epsilon)\). On the other hand, because for any \(s\geq 0\), \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\in[0,L(\bm{\theta})-m_{\bm{\theta}})\) (Proposition 5.13), it verifies that \(\forall\bm{\theta}\in\bm{\Theta}\): \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + L(D,\bm{\theta}) \leq L(\bm{\theta}) +L(D,\bm{\theta})\,. \end{equation} By definition of \(\bm{\theta}^{\times}_\epsilon\) and \(\bm{\theta}^{\star}_\epsilon\), \begin{equation} \mathcal{I}_{\bm{\theta}^{\times}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\times}_\epsilon) \leq \mathcal{I}_{\bm{\theta}^{\star}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\star}_\epsilon)\,, \end{equation} which gives \begin{equation} \mathcal{I}_{\bm{\theta}^{\times}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\times}_\epsilon) \leq L(\bm{\theta}^{\star}_\epsilon) + \hat{L}(D, \bm{\theta}^{\star}_\epsilon) \leq L(\bm{\theta}^{\star}_\epsilon) + \epsilon\,. \end{equation} This, in combination with the PAC-Chernoff bound of Theorem 5.17 gives \begin{equation} L(\bm{\theta}^{\times}_\epsilon) \leq \mathcal{I}_{\bm{\theta}^{\times}_\epsilon}^{-1}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) + \hat{L}(D, \bm{\theta}^{\times}_\epsilon) \leq L(\bm{\theta}^{\star}_\epsilon) + \epsilon\,. \end{equation} From this, \(L(\bm{\theta}^{\times}_\epsilon)\leq L(\bm{\theta}^\star_\epsilon) +\epsilon\). Thus, \(L(\bm{\theta}^\star_\epsilon) \leq L(\bm{\theta}^{\times}_\epsilon)\) and \(L(\bm{\theta}^{\times}_\epsilon)\leq L(\bm{\theta}^\star_\epsilon) +\epsilon\), finishing the proof. □
Proposition 5.27 Return to statement
If the loss function \(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) is Lipschitz w.r.t. \(\bm{\theta}\) with constant \(M > 0\), then, for any \(\bm{\theta}_0 \in \bm{\Theta}_0 = \{\bm{\theta} \in \bm{\Theta} \ | \ \mathbb{V}_{\nu}(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) = 0\}\), it verifies that \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \sqrt{2Ma}\ \|\bm{\theta} - \bm{\theta}_0\|_2\,, \end{equation} where \(a = min\big(1 , \tfrac{1}{n}\log\tfrac{k^p}{\delta} \big)\).
Proof. If the loss is Lipschitz continuous with constant \(M\), \(\forall y,x,\bm{\theta}\quad \|\nabla_{\bm{\theta}} \ell(\mathbf{y},\mathbf{x},\bm{\theta})\|^2_2\leq M\). Then, \(J_{\bm{\theta}}(\lambda)\) verifies \begin{align} \|\nabla_{\bm{\theta}}J_{\bm{\theta}}(\lambda)\|^2_2&=||-\lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] + \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) \right]||_2^2\\ &\leq \lambda^2 \mathbb{E}_{\nu p^\lambda}\left[\|\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\|_2^2\right] + \lambda^2\mathbb{E}_\nu\left[\|\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) \|_2^2\right]\\ &\leq 2M \lambda^2\,, \end{align} where \(\mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] = \frac{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})]}{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]}\). With this, \begin{equation} |J_{\bm{\theta}}(\lambda) - J_{\bm{\theta}_0}(\lambda)|\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2 \implies J_{\bm{\theta}}(\lambda)\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2\,. \end{equation} Then, for any \(a \geq 0\), it verifies \(\frac{a+J_{\bm{\theta}}(\lambda)}{\lambda}\leq \frac{a+2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2}{\lambda}\); where by definition of the inverse rate, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(a) \leq \frac{a+J_{\bm{\theta}}(\lambda)}{\lambda}\leq \frac{a+2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2}{\lambda}\,. \end{equation} As the inequality holds for any \(\lambda \geq 0\), take the one that minimizes the r.h.s, leading to \(\mathcal{I}^{-1}_{\bm{\theta}}(a)\leq \sqrt{2M a}\|\bm{\theta}-\bm{\theta}_0\|_2\). On the other hand, \(\mathcal{I}^{-1}_{\bm{\theta}}(a)\) is Lipschitz with constant \(M\) as \begin{equation} \| \nabla_{\bm{\theta}} \mathcal{I}^{-1}_{\bm{\theta}}(a)\|^2_2 = \Big\|\frac{\nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda^\star_a)}{\lambda^\star_a}\Big\|^2_2\leq \frac{\|\nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda^\star_a)\|^2_2}{\lambda^{\star,2}_a}\leq 2 M\,. \end{equation} In consequence, \(\textstyle (\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})-{\cal I}^{-1}_{\bm{\theta}_0}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}))^2 \leq 2M ||\bm{\theta} - \bm{\theta}_0||^2\). Which implies that \(\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \leq \sqrt{2M} ||\bm{\theta} - \bm{\theta}_0||_2\). Thus, it simultaneously holds that \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) \leq \sqrt{2M} ||\bm{\theta} - \bm{\theta}_0||_2 \quad \text{and} \quad \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\leq \sqrt{2M \tfrac{1}{n}\ln\tfrac{k^p}{\delta}}\|\bm{\theta}-\bm{\theta}_0\|_2\,. \end{equation} □
Corollary 5.28 Return to statement
Considering the log-loss, if it belongs to the exponential family with a constant base measure, that is, there exist \(s:\mathcal{X}\times \mathcal{Y} \to \mathbb{R}^p\), \(a:\bm{\Theta} \to \mathbb{R}\) and \(k \in \mathbb{R}\) such that \begin{equation} \ell(\mathbf{y},\mathbf{x},\bm{\theta}) := -\log p(\mathbf{y}|\mathbf{x}, \bm{\theta}) = \bm{\theta}^T s(\mathbf{x}, \mathbf{y}) - a(\bm{\theta}) + k\quad \forall (\mathbf{x}, \mathbf{y}) \sim \nu\,. \end{equation} Then, \(\forall \epsilon > 0\) exists \(n_0 > 0\) such that \(\forall n > n_0\): \begin{equation} \left|\mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) - \sqrt{2\tfrac{1}{n}\log\tfrac{k^p}{\delta}} \sqrt{\bm{\theta}^T \text{Cov}_\nu(s(\mathbf{y}, \mathbf{x})) \bm{\theta}}\right|\leq \epsilon\,, \end{equation} where \(\text{Cov}_{\nu}(\cdot)\) is the covariance w.r.t. \(\nu\) of the sufficient statistics of each \((\mathbf{x},\mathbf{y})\) sample.
Proof. Using Theorem 5.19, we got that \(\lim_{n\rightarrow\infty} \ \sqrt{n}\, \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta}) = \sqrt{2\mathbb{V}_\nu( \ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))\ln \tfrac{k^p}{\delta}}\). The proof concludes from the fact that using the exponential family, \begin{equation} \mathbb{V}_\nu\big(\ell(\mathbf{y},\mathbf{x},\bm{\theta}) \big) = \mathbb{V}_\nu\big(\bm{\theta}^T s(\mathbf{y},\mathbf{x}) - a(\bm{\theta}) + k\big) = \bm{\theta}^T\text{Cov}_{\nu} \big(s(\mathbf{y},\mathbf{x})\big) \bm{\theta}\,. \end{equation} □
Proposition 5.29 Return to statement
For any \(\bm{\theta} \in \bm{\Theta}\), if the equality of Equation \(\eqref{eq:secondorder}\) holds, then \begin{equation} \textstyle \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) = \sqrt{2\tfrac{1}{n}\log\tfrac{k^p}{\delta}}\sqrt{\big(\bm{\theta} - \bm{\theta}_0\big)^T \text{Cov}_{\nu} \big(\nabla_{\bm{\theta}} \log p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)\big)\big(\bm{\theta} - \bm{\theta}_0\big)}\,. \end{equation}
Proof. A second-order Taylor expansion of \(J_{\bm{\theta}}(\lambda)\) w.r.t. to \(\bm{\theta}\) centered around \(\bm{\theta}_0\) is \begin{equation} J_{\bm{\theta}_0}(\lambda) + \nabla_{\bm{\theta}} J_{\bm{\theta}_0}(\lambda)(\bm{\theta}-\bm{\theta}_0) + \frac{1}{2} (\bm{\theta}-\bm{\theta}_0)^T\nabla_{\bm{\theta} \bm{\theta}}J_{\bm{\theta}_0}(\lambda)(\bm{\theta}-\bm{\theta}_0)\,. \end{equation} By standard properties of the cumulant generating function over centered random variables, it verifies that \(J_{\bm{\theta}_0}(\lambda)=0\). While the \(\nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda)\) can be expressed as, \begin{equation} \nabla_{\bm{\theta}} J_{\bm{\theta}}(\lambda) = \lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})\right] - \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) \right]\,, \end{equation} where \(\nu p^{\lambda}\) denotes \(\nu p^{\lambda}(\mathbf{y},\mathbf{x}) = \frac{\nu(\mathbf{y},\mathbf{x})p(\mathbf{y}|\mathbf{x},\bm{\theta})^{\lambda}}{\mathbb{E}_\nu[p(\mathbf{y}|\mathbf{x},\bm{\theta})^{\lambda}]}\). At \(\bm{\theta}_0\), it verifies that \(\nu p^{\lambda} = \nu\), because, by definition, \(p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)\) is constant for any \((\mathbf{y},\mathbf{x})\). In consequence, the gradient at \(\bm{\theta}_0\) simplifies as, \begin{equation} \nabla_{\bm{\theta}} J_{\bm{\theta}_0}(\lambda) = \lambda \mathbb{E}_{\nu}\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})\right] - \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) \right] = 0\,. \end{equation}
The Hessian of the cumulant \(J_{\bm{\theta}}(\lambda)\) w.r.t. to \(\bm{\theta}\) can be written as follows: \begin{align} \nabla_{\bm{\theta} \bm{\theta}}J_{\bm{\theta}}(\lambda) &= \lambda^2 Cov_{\nu p^\lambda} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) ) + \lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta} \bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta})\right] \\ &\quad- \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta} \bm{\theta}}\ln p(\mathbf{y}|\mathbf{x},\bm{\theta}) \right]\,. \end{align} Again, at \(\bm{\theta}_0\), it verifies that \(\nu p^{\lambda} = \nu\), so the Hessian of the cumulant \(J_{\bm{\theta}}(\lambda)\) at \(\bm{\theta}_0\) simplifies as, \(\nabla_{\bm{\theta} \bm{\theta}}J_{\bm{\theta}_0}(\lambda) = \lambda^2 Cov_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0) )\). With this, the second order Taylor expansion of \(J_{\bm{\theta}}(\lambda)\) evaluated on \(\bm{\theta}_0\) is \(\frac{\lambda^2 }{2}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)\).
On the other hand, if, at the definition of \(\mathcal{I}^{-1}_{\bm{\theta}}(s)\) given in Equation \(\eqref{eq:inverseratefunction}\), replace \(J_{\bm{\theta}}(\lambda)\) by the above Taylor expansion, leading to \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) = \inf_{\lambda>0} \ \frac{s}{\lambda} + \frac{\lambda}{2}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)\,. \end{equation} The optimal value of \(\lambda\) is acquired taking derivatives on the above expression w.r.t. \(\lambda\), giving, \begin{equation} \frac{-s}{\lambda^2} + \frac{1}{2}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0) = 0\,. \end{equation} One may show that it is a minimum by taking the second derivative. From this, the optimal value of \(\lambda\) is \(\lambda^\star = \left(\frac{2s}{(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)}\right)^{\frac{1}{2}}\). Using this expression, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) = 2s \sqrt{\frac{1}{2s}(\bm{\theta}-\bm{\theta}_0)^T \text{Cov}_{\nu} (\nabla_{\bm{\theta}} \ln p(\mathbf{y}|\mathbf{x},\bm{\theta}_0)) (\bm{\theta}-\bm{\theta}_0)}\,. \end{equation} Evaluating the expression on \(s = \tfrac{1}{n}\ln\tfrac{k^p}{\delta}\) and \(\bm{\theta} = \mathbf{0}\) concludes the proof. □
Proposition 5.30 Return to statement
If \(\nu(\mathbf{x}, \mathbf{y})\) is a uniformly strictly log-concave density and \(\nu(\mathbf{y}|\mathbf{x})\) is deterministic, then \(\exists M > 0\), such that, for any \(\bm{\theta} \in \bm{\Theta}\), \begin{equation} \textstyle \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq \sqrt{\tfrac{1}{n}\log\tfrac{k^p}{\delta}}\sqrt{M\mathbb{E}_\nu\Big[\big\Vert\nabla_{\mathbf{x}}\ell(\mathbf{y},\mathbf{x},\bm{\theta})\big\Vert_{2}^{2}\Big]}\,. \end{equation}
Proof. Using (Chafaı̈, 2004)’s Corollary 2.1 on \(\phi(f)=- \ln f\) and \(f_{\bm{\theta}}(y,\mathbf{x})=e^{-\lambda \ell(\mathbf{y}, \mathbf{x},\bm{\theta})}\); it verifies that \begin{equation} J_{\bm{\theta}}(\lambda) \leq M\lambda^2\ \mathbb{E}_\nu\Big[\big|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})\big|_{2}^{2}\Big]\,. \end{equation} Then, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) = \inf_{\lambda > 0} \frac{s + J_{\bm{\theta}}(\lambda)}{\lambda} \leq \inf_{\lambda > 0} \frac{s + M\lambda^2\mathbb{E}_\nu[|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})|^2]}{\lambda}\,. \end{equation} Deriving w.r.t. \(\lambda\) to compute the optimal value, gives \(\lambda_{inf} = \sqrt{\frac{s}{M \mathbb{E}_\nu[|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})|^2]}}\). Using this, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}}(s) \leq \sqrt{s}\sqrt{M \mathbb{E}_\nu[|\nabla_{x}\ell(\mathbf{y}, \mathbf{x},\bm{\theta})|^2]}\,. \end{equation} □
Proposition 5.32 Return to statement
Under Assumption 5.31, let \(\mathbf{x}_t\) denote the random variable denoting a transformation of an input \(\mathbf{x}_0\) using a composition of \(t\) transformations, then \begin{equation} MI(\mathbf{y}\, ; \mathbf{x}_{t})\leq MI(\mathbf{y}\, ; \mathbf{x}_{t-1}) \quad \forall t \in \mathbb{N}^+\,. \end{equation} As a consequence, \(MI(\mathbf{y}; \mathbf{x}) = MI(\mathbf{y}; \mathbf{x}_T) \leq MI(\mathbf{y}; \mathbf{x}_0)\).
Proof. According to Assumption 5.31, the targets \(\mathbf{y}\) are independent of the input \(\mathbf{x}_t\) given \(\mathbf{x}_{t-1}\), that is, \(\mathbf{y} \perp \mathbf{x}_t|\mathbf{x}_{t-1}\). This is a result of the fact that \(\mathbf{x}_{t-1}\) d-connects \(\mathbf{x}_t\) and \(\mathbf{y}\) in the graph representation of Assumption 5.31. The result follows then by the data processing inequality. □
Proposition 5.33 Return to statement
Under Assumption 5.31, if \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is convex under \(\mathbf{x}\) and \(\mathbb{E}_{g_t}[g_t(\mathbf{x})]=\mathbf{x}\), then, \(\forall\bm{\theta}\in \bm{\Theta},\ L^{\nu_{t+1}}(\bm{\theta})\geq L^{\nu_t}(\bm{\theta})\).
Proof. By definition, \(L^{\nu_{t+1}}(\bm{\theta})= \mathbb{E}_{\nu_{t+1}}[\ell(\mathbf{y},\mathbf{x}_{t+1}),\bm{\theta})] = \mathbb{E}_{\nu_t}\mathbb{E}_{g_{t+1}}[\ell(\mathbf{y},g_{t+1}(\mathbf{x}_t),\bm{\theta})]\). As \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is convex in \(\mathbf{x}\), the expectation under \(g_{t+1}\) can be moved inside \(\ell\): \begin{equation} L^{\nu_{t+1}}(\bm{\theta}) \geq \mathbb{E}_{\nu_t} [\ell(\mathbf{y},\mathbb{E}_{g_{t+1}}[g_{t+1}(\mathbf{x}_t)],\bm{\theta})]\,. \end{equation} Given that \(\mathbb{E}_{g_{t+1}}[g_{t+1}(\mathbf{x})]=\mathbf{x}\), it verifies that \(L^{\nu_{t+1}}(\bm{\theta}) \geq \mathbb{E}_{\nu_t}[\ell(\mathbf{y}, \mathbf{x}_t,\bm{\theta})] = L^{\nu_{t}}(\bm{\theta})\). □
Proposition 5.34 Return to statement
Under Assumption 5.31, if the change observed in the loss after a transformation \(\Delta(\mathbf{y},\mathbf{x},g,\bm{\theta}) = \ell(\mathbf{y},g(\mathbf{x}),\bm{\theta}) - \ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is statistically independent of the loss itself, \(\Delta(\mathbf{y},\mathbf{x},g,\bm{\theta})\perp \ell(\mathbf{y},\mathbf{x},\bm{\theta})\), then \begin{equation} {\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a)\leq {\cal I}^{\nu_{t}}_{\bm{\theta}}(a)\quad \forall a>0\,. \end{equation}
Proof. By definition, it verifies that \(J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t+1}}[e^{-\ell(\mathbf{y},\mathbf{x}_{t+1},\bm{\theta})}]\), where \(\nu_{t+1}\) can be expanded leading to \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}}\left[e^{-\lambda \ell(\mathbf{y},g_{t+1}(\mathbf{x}_{t}),\bm{\theta})}\right]\,. \end{equation} Using the decomposition of the loss, \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}}\left[e^{-\lambda (\ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})+\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta}))}\right]\,, \end{equation} which can be written as \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) = \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\left[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}\right]+\ln \mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}}\left[e^{-\lambda\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta})}\right]. \end{equation} Using Jensen’s inequality in the last term, \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) \geq \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\left[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}\right]-\lambda\mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}} [\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta})]\,, \end{equation} and using the definition of \(\Delta(\mathbf{y},\mathbf{x}_t,g_{t+1},\bm{\theta})\), \begin{equation} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) \geq \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}\left[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}\right]-\lambda\mathbb{E}_{\nu_{t}}\mathbb{E}_{g_{t+1}} [\ell(\mathbf{y},g_{t+1}(\mathbf{x}_t),\bm{\theta})] + \lambda\mathbb{E}_{\nu_{t}}[\ell(\mathbf{y},\mathbf{x}_t,\bm{\theta})]]\,. \end{equation} By definition of the expected loss, \begin{align} J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) &\geq \lambda L^{\nu_{t+1}}(\bm{\theta}) + \ln \mathbb{E}_{\nu_{t}}[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}]-\lambda L^{\nu_{t+1}}(\bm{\theta}) + \lambda L^{\nu_{t}}(\bm{\theta})\\ &= \ln \mathbb{E}_{\nu_{t}}[e^{-\lambda \ell(\mathbf{y},\mathbf{x}_{t},\bm{\theta})}]+\lambda L^{\nu_{t}}(\bm{\theta}) = J^{\nu_{t}}_{\bm{\theta}}(\lambda)\,. \end{align} From this inequality and by considering standard properties of the Legendre transform, it verifies that \(J^{\nu_{t+1}}_{\bm{\theta}}(\lambda) \geq J^{\nu_{t}}_{\bm{\theta}}(\lambda)\), which implies that \({\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a)\leq {\cal I}^{\nu_{t}}_{\bm{\theta}}(a)\). □
Proposition 5.36 Return to statement
Under Assumption 5.31, if a model \(\bm{\theta} \in \bm{\Theta}\) is \(G_t\)-invariant then, \begin{equation} L^{\nu_{t+1}}(\bm{\theta}) = L^{\nu_t}(\bm{\theta}) \quad \text{and} \quad {\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a) = {\cal I}^{\nu_{t}}_{\bm{\theta}}(a)\quad \forall a>0\,. \end{equation}
Proof. It immediately follows from the definition of model invariant (Definition 5.35) and the definitions of \(L^{\nu_{t+1}}(\bm{\theta})\), \(L^{\nu_t}(\bm{\theta})\), \({\cal I}^{\nu_{t+1}}_{\bm{\theta}}(a)\) and \({\cal I}^{\nu_{t}}_{\bm{\theta}}\). □
Theorem 5.37 Return to statement
Under Assumption 5.31, it verifies that \begin{equation} \forall \bm{\theta}\in\bm{\Theta},\quad L^{\nu_T,\ell}(\bm{\theta}) =L^{\nu_0,\ell_G}(\bm{\theta})\quad and \quad \mathcal{I}^{\nu_T,\ell}_{\bm{\theta}}(a)\leq \mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}(a) \,\,\, \forall a>0\,. \end{equation}
Proof. Under Assumption 5.31, it is clear that \begin{equation} L^{v_0, \ell_G}(\bm{\theta}) = \mathbb{E}_{\nu_0}[\ell_G(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = \mathbb{E}_{\nu_0}\mathbb{E}_{g\sim h}[\ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta}))] = \mathbb{E}_{\nu}[\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = L^{\nu_T, \ell}(\bm{\theta})\,. \end{equation}
On the other hand, from the definition of the cummulant function, \(J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu_T}[e^{-\lambda\ell(\mathbf{y},\mathbf{x}, \bm{\theta})}] + \lambda\mathbb{E}_{\nu_T}[\ell (\mathbf{y},\mathbf{x}, \bm{\theta})]\). Where by definition \begin{equation} J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu_0} \mathbb{E}_{g\sim h}[e^{-\lambda\ell(\mathbf{y},g(\mathbf{x}), \bm{\theta})}] + \lambda\mathbb{E}_{\nu_0} \mathbb{E}_{g\sim h}[\ell (\mathbf{y},g(\mathbf{x}), \bm{\theta})]\,, \end{equation} where expectations can be exchanged as \begin{equation} J_{\bm{\theta}}(\lambda) = \ln \mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\mathbb{E}_{g}[e^{-\lambda\ell(\mathbf{y},g(\mathbf{x}), \bm{\theta})}] + \lambda\mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\mathbb{E}_{g}[\ell (\mathbf{y},g(\mathbf{x}), \bm{\theta})]\,. \end{equation} Applying Jensen’s inequality to the exponential, \begin{equation} J_{\bm{\theta}}(\lambda) \geq \ln \mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\left[e^{-\lambda\mathbb{E}_{g}\left[\ell(\mathbf{y},g(\mathbf{x}), \bm{\theta})\right]}\right] + \lambda\mathbb{E}_{\nu_0(\mathbf{x})}\mathbb{E}_{\nu(\mathbf{y}|\mathbf{x})}\mathbb{E}_{g}[\ell (\mathbf{y},g(\mathbf{x}), \bm{\theta})]\,. \end{equation} Where by definition of \(\ell_G\), \begin{equation} J_{\bm{\theta}}(\lambda) \geq \ln \mathbb{E}_{\nu_0}[e^{-\lambda[\ell_{G}(\mathbf{y},\mathbf{x}, \bm{\theta})]}] + \lambda\mathbb{E}_{\nu_0}[\ell_{G} (\mathbf{y},\mathbf{x}, \bm{\theta})] = J^{\nu_0, \ell_G}_{\bm{\theta}}(\lambda)\,. \end{equation} From this point, the inequality regarding the inverse rate is clear from the definition. □
Corollary 5.38 Return to statement
With h.p. \(1 - \delta\) over \(D_0 \sim \nu^n_0\), for all \(\bm{\theta}\in \bm{\Theta}\), simultaneously, \begin{equation} \text{if } \hat{L}^{\ell_G}(D_0, \bm{\theta}) \leq \epsilon \quad\text{then}\quad \big(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}\big)^{-1}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}) \leq L^{\nu_T,\ell}(\bm{\theta}) \leq \big(\mathcal{I}^{\nu_0,\ell_G}_{\bm{\theta}}\big)^{-1}(\textstyle \tfrac{1}{n}\log\tfrac{k^p}{\delta}) + \epsilon\,. \end{equation}
Proof. Under Assumption 5.31, it is clear that \begin{equation} L^{v_0, \ell_G}(\bm{\theta}) = \mathbb{E}_{\nu_0}[\ell_G(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = \mathbb{E}_{\nu_0}\mathbb{E}_{g\sim h}[\ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta}))] = \mathbb{E}_{\nu}[\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}))] = L^{\nu_T, \ell}(\bm{\theta})\,. \end{equation} Where it verifies that \(L(\bm{\theta}) := L^{\nu_T, \ell}(\bm{\theta})\). Thus, the result comes from the application of Theorem 5.17. □
Proposition 5.39 Return to statement
Under Assumption 5.31, if the transformations \(G\) define a group and its probability distribution \(h\) is uniform, then \begin{equation} \hat{L}^{\ell_{G}}(D_T, \bm{\theta})=\hat{L}^{\ell_{G}}(D_0, \bm{\theta})\,, \end{equation} where \(D_0\) is any un-transformed dataset and \(D_T\) is the corresponding transformed dataset obtained by applying a transformation \(g\sim h\) to each of the samples in \(D_0\).
Proof. On one hand, by definition of \(\ell_G\) it verifies that \begin{align} \hat{L}^{\ell_G}(D_T, \bm{\theta}) &= \frac{1}{n}\sum_i \mathbb{E}_{g' \sim h}[\ell(\mathbf{y}_i, g'(\mathbf{x}_i),\bm{\theta})] = \frac{1}{n}\sum_i \int h(g') \ell(\mathbf{y}_i, g'(g_i(x_{0,i})),\bm{\theta})\ dg'\,. \end{align} As the set of transformations define a group, for each transformation \(g_i\) that generated each input, there exists \(g_i^{-1}\). Furthermore, there exists \(l := g' \circ g_i \in G_1\) such that \(g' = l\circ g_i^{-1}\). Thus, \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta}) = \frac{1}{n}\sum_i \int_{g' = l \circ g_i^{-1}} h(l \circ g_i^{-1}) \ell(\mathbf{y}_i,l \circ g_i^{-1}\circ g_i(x_{0,i}),\bm{\theta})\ dg'\,. \end{equation} Then, using a change of variables \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta}) = \frac{1}{n}\sum_i \int_{l} h(l \circ g_i^{-1}) \ell(\mathbf{y}_i,l(x_{0,i}),\bm{\theta})\ dl\,. \end{equation} As \(h\) is uniform \(h(l \circ g_i^{-1}) = h(l)\), \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta}) = \frac{1}{n}\sum_i \int_{l} h(l)\ell(\mathbf{y}_i,l(x_{0,i}),\bm{\theta})dl =\hat{L}^{\ell_G}(D_0, \bm{\theta})\,. \end{equation} □
Proposition 5.40 Return to statement
Given an untransformed dataset \(D_0\), for any transformed dataset \(D_T\) derived from \(D_0\) under Assumption 5.31, if \begin{equation} m_{\bm{\theta}}:=\essinf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, g(\mathbf{x}),\bm{\theta}) \quad \forall g \in G\,, \end{equation} and \(\forall g \in G\), \(\exists g^{-1} \in G\). Then, it verifies that \begin{equation} \forall \bm{\theta}\in\bm{\Theta} \quad \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \hat{L}^\ell(D_0, \bm{\theta})= m_{\bm{\theta}}\,. \end{equation}
Proof. First of all, by definition we got that \(m_{\bm{\theta}} = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta})\). Then, it verifies that \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell_G(\mathbf{y}, \mathbf{x}, \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall (\mathbf{x}, \mathbf{y}) \in D_T\,. \end{equation} Thus, by definition of \(\ell_G\): \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \mathbb{E}_{g}[\ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta})] = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall (\mathbf{x}, \mathbf{y}) \in D_T\,. \end{equation} Now, using that \(ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, g(\mathbf{x}),\bm{\theta}) = m_{\bm{\theta}} \ \forall g \in G\), reaching the essential infimum in expectation means all the losses inside the expectation reach such infimum: \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell(\mathbf{y}, g(\mathbf{x}), \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall g\in G, \forall (\mathbf{x}, \mathbf{y}) \in D_T\,. \end{equation} Using that for every \((\mathbf{x}_0, \mathbf{y}) \in D_0\) exists \(g' \in G\) associated to that input such that \(\mathbf{x} = g'(\mathbf{x}_0)\) and that \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell(\mathbf{y}, g\circ g'(\mathbf{x}_0), \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall g\in G, \forall (\mathbf{x}_0, \mathbf{y}) \in D_0\,. \end{equation} Then, using that there exists \(g'^{-1} \in G\), we got that \begin{equation} \hat{L}^{\ell_G}(D_T, \bm{\theta})= m_{\bm{\theta}} \iff \ell(\mathbf{y}, \mathbf{x}_0, \bm{\theta}) = ess\inf_{(\mathbf{x}, \mathbf{y}) \sim \nu_T} \ell(\mathbf{y}, \mathbf{x},\bm{\theta}) \quad \forall (\mathbf{x}_0, \mathbf{y}) \in D_0\,. \end{equation} As a result, the empirical loss \(\hat{L}^{\ell}(D_0, \bm{\theta})\) is equal to \(m_{\bm{\theta}}\) too. □
Theorem 5.41 Return to statement
For any \(\epsilon\in(0, L^\star)\) and any \(\delta \in (0,1)\), with high probability \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \text{if} \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad \text{then}\quad p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star-\epsilon) + \log\delta}{\log k}\,. \end{equation}
Proof. Due to fact that \(L(D,\bm{\theta})\leq \epsilon\) and to Theorem 5.17, we have that with high probability \(1-\delta\) over \(D\sim\nu^n(y,x)\), \begin{equation} L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} Then, just rearranging terms as follows, we get \(L(\bm{\theta})-\epsilon \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Applying the rate function at both sides, \begin{equation} \mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) \leq \tfrac{1}{n}\ln\tfrac{k^p}{\delta} = \frac{1}{n}\left(\ln k^p - \ln \delta \right)\,. \end{equation} Then, we got \(\ln k^p \geq n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta\) and \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\,. \end{equation} As \(L^\star\leq L(\bm{\theta})\) and the rate function is monotonically increasing, then \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star - \epsilon) + \ln\delta}{\ln k}\,. \end{equation} □
Corollary 5.42 Return to statement
If \(\ell(\mathbf{y},\mathbf{x},\bm{\theta})\) is Lipschitz w.r.t. \(\mathbf{x}\) with constant \(Lip(\bm{\theta})\) and satisfies a \(c\)-isoperimetry assumption, then for any \(\epsilon\in(0, L^\star)\) and any \(\delta \in (0,1)\), with high probability \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \text{if} \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad \text{then}\quad Lip(\bm{\theta})\geq \sqrt{\tfrac{nd} {2c(p\log k - \log \delta)}}(L^\star - \epsilon) \,. \end{equation}
Proof. Under the isoperimetry assumption, we have \(\mathcal{I}_{\bm{\theta}}(a) \leq \frac{d a^2}{2 c Lip(\bm{\theta})^2 }\). Due to fact that \(L(D,\bm{\theta})\leq \epsilon\) and to Theorem 5.17, we have that with high probability \(1-\delta\) over \(D\sim\nu^n(y,x)\), \begin{equation} L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\,. \end{equation} Then, just rearranging terms as follows, we get \(L(\bm{\theta})-\epsilon \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Applying the rate function at both sides, \begin{equation} \mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) \leq \tfrac{1}{n}\ln\tfrac{k^p}{\delta} = \frac{1}{n}\left(\ln k^p - \ln \delta \right)\,. \end{equation} Then, we got \(\ln k^p \geq n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta\) and \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\,. \end{equation} As \(L^\star\leq L(\bm{\theta})\) and the rate function is monotonically increasing, then \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star - \epsilon) + \ln\delta}{\ln k}\,. \end{equation} Using the upper bound over the rate function we obtained before, we got that \begin{equation} p \geq \frac{n\frac{d(L^\star - \epsilon)^2}{2cLip(\bm{\theta})^2} + \ln\delta}{\ln k}\,. \end{equation} Re-arranging terms, \begin{equation} \sqrt{\frac{nd(L^\star - \epsilon)^2} {2c(p\ln k - \ln \delta)}} \leq Lip(\bm{\theta})\,. \end{equation} □
Corollary 5.43 Return to statement
If the loss function \(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\) is Lipschitz w.r.t. \(\bm{\theta}\) with constant \(M > 0\), then, for any \(\bm{\theta}_0 \in \bm{\Theta}_0 = \{\bm{\theta} \in \bm{\Theta} \ | \ \mathbb{V}_{\nu}(\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})) = 0\} \subset \bm{\Theta}\), any \(\epsilon\in(0, L^\star)\) and any \(\delta \in (0,1)\), with high probability \(1-\delta\) over \(D\sim\nu^n\), for all \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \text{if} \quad \hat{L}(D, \bm{\theta})\leq \epsilon \quad \text{then}\quad \|\bm{\theta}-\bm{\theta}_0\|_2\geq \sqrt{\tfrac{n}{8M(p\log k - \log \delta)}} (L^\star - \epsilon)\,. \end{equation}
Proof. If the loss is Lipschitz continuous with constant \(M\), \(\forall y,x,\bm{\theta}\quad \|\nabla_{\bm{\theta}} \ell(\mathbf{y},\mathbf{x},\bm{\theta})\|^2_2\leq M\). Then, \(J_{\bm{\theta}}(\lambda)\) verifies \begin{equation} \|\nabla_{\bm{\theta}}J_{\bm{\theta}}(\lambda)\|^2_2=||-\lambda \mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] + \lambda \mathbb{E}_\nu\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta}) \right]||_2^2 \leq 2M \lambda^2\,. \end{equation} where \(\mathbb{E}_{\nu p^\lambda}\left[\nabla_{\bm{\theta}}\ell(\mathbf{y}, \mathbf{x}, \bm{\theta})\right] = \frac{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda \ell(\mathbf{y}, \mathbf{x}, \bm{\theta})]}{\mathbb{E}_{\nu}[p(\mathbf{y}|\mathbf{x}, \bm{\theta})^\lambda]}\). With this, we have \begin{equation} |J_{\bm{\theta}}(\lambda) - J_{\bm{\theta}_0}(\lambda)|\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2 \implies J_{\bm{\theta}}(\lambda)\leq 2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2\,. \end{equation} Then, for any \(a \geq 0\), we have by definition of the rate function that \begin{equation} \mathcal{I}_{\bm{\theta}}(a) \geq a\lambda - J_{\bm{\theta}}(\lambda) \geq a\lambda-2M \lambda^2\|\bm{\theta}-\bm{\theta}_0\|^2_2\,. \end{equation} As the inequality holds for any \(\lambda > 0\), maximizing it raises \(\lambda^\star = \frac{a}{4M\|\bm{\theta}-\bm{\theta}_0\|^2_2}\). Thus, \begin{equation} \mathcal{I}_{\bm{\theta}}(a) \geq a\lambda - J_{\bm{\theta}}(\lambda) \geq a\frac{a}{4M\|\bm{\theta}-\bm{\theta}_0\|^2_2}-2M \frac{a^2}{(4M\|\bm{\theta}-\bm{\theta}_0\|^2_2)^2}\|\bm{\theta}-\bm{\theta}_0\|^2_2 = \frac{a^2}{8M\|\bm{\theta}-\bm{\theta}_0\|^2_2}\,. \end{equation} Due to fact that \(L(D,\bm{\theta})\leq \epsilon\) and to Theorem 5.17, we have that with high probability \(1-\delta\) over \(D\sim\nu^n(\mathbf{x}, \mathbf{y})\), \(L(\bm{\theta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Then, just rearranging terms as follows, we get \(L(\bm{\theta})-\epsilon \leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^p}{\delta})\). Applying the rate function at both sides, \begin{equation} \mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) \leq \tfrac{1}{n}\ln\tfrac{k^p}{\delta} = \frac{1}{n}\left(\ln k^p - \ln \delta \right)\,. \end{equation} Then, we got \(\ln k^p \geq n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta\) and \(p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\). As \(L^\star\leq L(\bm{\theta})\) and the rate function is monotonically increasing, then \begin{equation} p \geq \frac{n\mathcal{I}_{\bm{\theta}}(L(\bm{\theta})-\epsilon) + \ln\delta}{\ln k}\geq \frac{n\mathcal{I}_{\bm{\theta}}(L^\star - \epsilon) + \ln\delta}{\ln k}\,. \end{equation} Using the upper bound over the rate function we obtained before, we got that \begin{equation} p \geq \frac{n\frac{(L^\star - \epsilon)^2}{8M\|\bm{\theta}-\bm{\theta}_0\|^2_2} + \ln\delta}{\ln k} \implies (L^\star - \epsilon)\sqrt{\frac{n}{8M(p\ln k - \ln \delta)}} \leq \|\bm{\theta}-\bm{\theta}_0\|_2\,. \end{equation} □
Corollary 5.25 Return to statement
Let \(\bm{\Theta}\subset \bm{\Theta}'\) be two nested model classes with \(p < p'\) parameters respectively. For any \(\epsilon>0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), for any \(\bm{\theta}'\in\bm{\Theta}'\), \(\bm{\theta}\in\bm{\Theta}\), simultaneously, \begin{equation} \begin{aligned} \hat{L}(D, \bm{\theta}')\leq\epsilon, \ \hat{L}(D, \bm{\theta})\leq&\,\epsilon \text{ and } L(\bm{\theta}') + \epsilon < L(\bm{\theta})\\ &\Downarrow\\ \bm{\theta} \text{ is not } \mathcal{I}_{\bm{\theta}}^{-1}\big(\tfrac{1}{n}\log&\tfrac{k^{p'}}{\delta}\big)\text{-smoother than } \bm{\theta}'\,. \end{aligned} \end{equation}
Proof. We can apply Theorem 5.17 on \(\bm{\theta}\) assuming that \(\bm{\theta}\in\bm{\Theta}'\), because we have that \(\bm{\Theta}\subset \bm{\Theta}'\). In consequence, with h.p., we have \begin{equation} L(\bm{\theta}) \leq L(D,\bm{\theta}) + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta}) \end{equation} Using the fact that \(L(D,\bm{\theta})\leq\epsilon\), and that the inverse rate is bounded by the expected loss, we arrive to the following inequality \begin{equation} L(\bm{\theta}) \leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \epsilon + L(\bm{\theta}) \end{equation} The same reasoning applies for \(\bm{\theta}'\): \begin{equation} L(\bm{\theta}') \leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \epsilon + L(\bm{\theta}') \end{equation} Using the theorem’s premise, \(L(\bm{\theta}') + \epsilon< L(\bm{\theta})\), we can chain the last two h.p. upper bounds. And note that we can do that with no extra cost, as both apply on the same bigger model class \(\bm{\Theta}'\). That is, it is the same upper bound, which holds simultaneously for all models within \(\bm{\Theta}'\), used on two different models \(\bm{\theta}\) and \(\bm{\theta}'\). Then, we have, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})< \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\,. \end{equation} Naming \(a=\mathcal{I}^{-1}_{\bm{\theta}}(\frac{1}{n}\ln\frac{k^{p'}}{\delta})\), such that \(\frac{1}{n}\ln\frac{k^{p'}}{\delta}=\mathcal{I}_{\bm{\theta}}(a)\), we got that \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\mathcal{I}_{\bm{\theta}}(a))< \mathcal{I}^{-1}_{\bm{\theta}}(\mathcal{I}_{\bm{\theta}}(a))= a\,. \end{equation} Applying \(\mathcal{I}_{\bm{\theta}'}(\cdot)\) at both sides: \begin{equation} \mathcal{I}_{\bm{\theta}'}(\mathcal{I}^{-1}_{\bm{\theta}'}(\mathcal{I}_{\bm{\theta}}(a))) = \mathcal{I}_{\bm{\theta}}(a) < \mathcal{I}_{\bm{\theta}'}(a)\,. \end{equation} Then, we can negate the inverse statement using \(\neg\) notation as \(\neg\Big[ \mathcal{I}_{\bm{\theta}}(a)\geq \mathcal{I}_{\bm{\theta}'}(a)\Big]\). Then we have that \(\bm{\theta}\) is not \(\beta\)-smoother than \(\bm{\theta}'\) for any \(\beta\geq a\). Which is equivalent to say that, \(\bm{\theta}\) is not \(\beta\)-smoother than \(\bm{\theta}'\) for any \(\beta\) such that \(\mathcal{I}_{\bm{\theta}}(\beta)\geq \frac{1}{n}\ln\frac{k^{p'}}{\delta}\). □
Theorem 5.24 Return to statement
Let \(\bm{\Theta}\subset \bm{\Theta}'\) be two nested model classes with \(p < p'\) parameters respectively. For any \(\epsilon>0\), with h.p. \(1-\delta\) over \(D\sim\nu^n\), for any \(\bm{\theta}'\in\bm{\Theta}'\), \(\bm{\theta}\in\bm{\Theta}\), simultaneously \begin{equation} \begin{aligned} \text{ if } \hat{L}(D, \bm{\theta}')\leq\epsilon \,,\, \hat{L}(D, \bm{\theta}) \leq \epsilon \, \text{ and }\, \bm{\theta}' \, &\text{ is } \,\mathcal{I}^{-1}_{\bm{\theta}'}\big(\tfrac{1}{n}\log \tfrac{k^{p'}}{\delta}\big)\text{-smoother than } \, \bm{\theta}\\ &\Downarrow\\ L(\bm{\theta}')\leq &\, L(\bm{\theta})+\epsilon\,. \end{aligned} \end{equation}
Proof. If \(\bm{\theta}'\) is \(\beta\)-smoother than \(\bm{\theta}\), by Definition 5.21, \(\forall a\in(0,\beta] \quad \mathcal{I}_{\bm{\theta}'}(a)\geq \mathcal{I}_{\bm{\theta}}(a)\) where \(\beta = \mathcal{I}^{-1}_{\bm{\theta}'}(\frac{1}{n}\ln\frac{k^{p'}}{\delta})\). Then, we have that \(\mathcal{I}^{-1}_{\bm{\theta}'}(s)\leq \mathcal{I}^{-1}_{\bm{\theta}}(s)\) for \(s=\mathcal{I}_{\bm{\theta}'}(\beta) = \mathcal{I}_{\bm{\theta}'}\Big(\mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\Big) = \tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta}\). Thus, \begin{equation} \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq L(\bm{\theta})\,. \end{equation} Where we used that the inverse rate is upper bounded by \(L(\bm{\theta})\). By Theorem 5.17, because \(L(D,\bm{\theta})\leq \epsilon\), we have with h.p. \(1-\delta\) over \(D\sim\nu^n\), \begin{equation} L(\bm{\theta}')\leq L(D,\bm{\theta}') + \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta})\,. \end{equation} By combining the last two inequalities, we have \begin{equation} L(\bm{\theta}')\leq \epsilon + \mathcal{I}^{-1}_{\bm{\theta}'}(\tfrac{1}{n}\ln\tfrac{k^{p'}}{\delta}) \leq \epsilon + L(\bm{\theta})\,. \end{equation} □
The Implicit Bias of Stochastic Gradient Descent
Proposition 5.45 Return to statement
For any \(D\sim\nu^n\) and any \(\bm{\theta} \in \bm{\Theta}\), it verifies that \begin{equation} \hat{L}(D, \bm{\theta}) = L(\bm{\theta}) - \mathcal{I}^{-1}_{\bm{\theta}}(\alpha(D, \bm{\theta}))\,, \end{equation} where \(\alpha:{\cal D}\times\bm{\Theta} \to \mathbb{R}\) is defined as \(\alpha(D, \bm{\theta}) := \mathcal{I}_{\bm{\theta}}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}))\).
Proof. Direct consequence of the signed rate function being invertible in \(\mathbb{R}\). □
Theorem 5.46 Return to statement
For any \(\bm{\theta} \in \bm{\Theta}\), \(n>0\), and \(D \sim \nu^n\), the cumulative distribution of \(\alpha(D, \bm{\theta})\) satisfies \begin{align} &\forall s > 0 \quad \mathbb{P}_{D\sim \nu^n}\big(\alpha(D, \bm{\theta}) \geq s\big)\leq e^{-n |s|}\,,\\ &\forall s < 0 \quad \mathbb{P}_{D\sim \nu^n}\big(\alpha(D, \bm{\theta}) \leq s\big)\leq e^{-n |s|}\,, \end{align} and both inequalities are asymptotically tight, \begin{align} &\forall s > 0 \quad \mathbb{P}_{D\sim \nu^n}(\alpha(D, \bm{\theta})\geq s) \asymp e^{- n |s|}\,,\\ &\forall s < 0 \quad \mathbb{P}_{D\sim \nu^n}(\alpha(D, \bm{\theta})\leq s) \asymp e^{- n|s|}\,. \end{align}
Proof. From Chernoff’s bound, it verifies that \begin{align} &\forall a \geq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq a\big)\leq e^{-n |\mathcal{I}_{\bm{\theta}}(a)|},\\ &\forall a \leq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \leq a\big)\leq e^{-n |\mathcal{I}_{\bm{\theta}}(a)|}\,. \end{align} As a result, for any value of \(s \in \mathbb{R}\), taking \(a = \mathcal{I}^{-1}_{\bm{\theta}}(s)\), we got \begin{align} &\forall s \geq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \geq \mathcal{I}^{-1}_{\bm{\theta}}(s)\big)\leq e^{-n |s|},\\ &\forall s \leq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(L(\bm{\theta}) - \hat{L}(D, \bm{\theta}) \leq \mathcal{I}^{-1}_{\bm{\theta}}(s)\big)\leq e^{-n |s|}\,. \end{align} As \(\mathcal{I}_{\bm{\theta}}(\cdot)\) is a strictly monotonic and increasing function, we can apply it at both sides of the inequality inside the probability, giving us: \begin{align} &\forall s \geq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(\mathcal{I}_{\bm{\theta}}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})) \geq s\big)\leq e^{-n |s|},\\ &\forall s \leq 0, \quad \mathbb{P}_{D\sim\nu^n}\big(\mathcal{I}_{\bm{\theta}}(L(\bm{\theta}) - \hat{L}(D, \bm{\theta})) \leq s\big)\leq e^{-n |s|}\,. \end{align} The asymptotic inequalities can be obtained by applying the same reasoning to Equation \(\eqref{eq:asympoticEquality}\), which is a direct consequence of Cramér’s Theorem. □