Uncertainty Estimation and Generalization Bounds for Modern Deep Learning

Abbasi-Yadkori, Y., Bartlett, P. L., Kanade, V., Seldin, Y., & Szepesvári, C. (2013). Online learning in markov decision processes with adversarially chosen transition probability distributions. Advances in Neural Information Processing Systems (NeurIPS).

Abbasi-Yadkori, Y., Bartlett, P., Gabillon, V., Malek, A., & Valko, M. (2018). Best of both worlds: Stochastic & adversarial best-arm identification. Proceedings of the Conference on Learning Theory (COLT).

Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems (NeurIPS).

Abbasi-Yadkori, Y., & Szepesvári, C. (2011). Regret bounds for the adaptive control of linear quadratic systems. Proceedings of the Conference on Learning Theory (COLT).

Abbeel, P., Koller, D., & Y.Ng, A. (2006). Learning factor graphs in polynomial time and sample complexity. Journal of Machine Learning Research.

Abernethy, J., Hazan, E., & Rakhlin, A. (2008). Competing in the dark: An efficient algorithm for bandit linear optimization. Proceedings of the Conference on Learning Theory (COLT).

Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2012). Learning from data. AMLbook.

Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2015). Learning from data. Dynamic e-chapters. AMLbook.

Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection. arXiv Preprint arXiv:0710.3742.

Adi, Y., Schwing, A., & Hazan, T. (2020). PAC-Bayesian neural network bounds. https://openreview.net/forum?id=HkgR8erKwB

Agarwal, A., Dudík, M., Kale, S., Langford, J., & Schapire, R. E. (2012). Contextual bandit learning with predictable rewards. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., & Schapire, R. E. (2014). Taming the monster: A fast and simple algorithm for contextual bandits. Proceedings of the International Conference on Machine Learning (ICML).

Agarwal, A., Krishnamurthy, A., Langford, J., Luo, H., & Schapire, R. E. (2017a). Open problem: First-order regret bounds for contextual bandits. Proceedings of the Conference on Learning Theory (COLT).

Agarwal, A., Luo, H., Neyshabur, B., & Schapire, R. E. (2017b). Corralling a band of bandit algorithms. Proceedings of the Conference on Learning Theory (COLT).

Aggarwal, C. C. (2007). Data streams: Models and algorithms (Vol. 31). Springer Science & Business Media.

Aggarwal, C. C. (2013). Managing and mining sensor data. Springer Science & Business Media.

Ahissar, M., & Hochstein, S. (2004). The reverse hierarchy theory of visual perceptual learning. TRENDS in Cognitive Sciences, 8(10), 457–464.

Ahmed, A., Ho, Q., Teo, C. H., Eisenstein, J., Smola, A. J., & Xing, E. P. (2011). Online inference for the infinite topic-cluster model: Storylines from streaming text. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS), 101–109.

Ailon, N., Karnin, Z., & Joachims, T. (2014). Reducing dueling bandits to cardinal bandits. Proceedings of the International Conference on Machine Learning (ICML).

Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., & Watson, J. D. (1994). Molecular biology of the cell (3rd ed.). Garland Publishing.

Allenby, G. M., & Rossi, P. E. (1998). Marketing models of consumer heterogeneity. Journal of Econometrics, 89(1-2), 57–78.

Alon, N., Cesa-Bianchi, N., Gentile, C., & Mansour, Y. (2013). From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems (NeurIPS).

Alquier, P. (2024). User-friendly introduction to PAC-bayes bounds. Foundations and Trends in Machine Learning, 17(2), 174–303. https://doi.org/10.1561/2200000100

Alquier, P., & Guedj, B. (2018). Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5), 887–902.

Alquier, P., Ridgway, J., & Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research, 17(1), 8374–8414.

Alter, O., Brown, P. O., & Botstein, D. (2003). Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proceedings of the National Academy of Science.

Álvarez, M. A., & Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12, 1459–1500.

Ambroladze, A., Parrado-Hernández, E., & Shawe-Taylor, J. (2007). Tighter PAC-Bayes bounds. Advances in Neural Information Processing Systems (NeurIPS).

Aminikhanghahi, S., & Cook, D. J. (2017). A survey of methods for time series change point detection. Knowledge and Information Systems, 51(2), 339–367.

Angluin, D. (2004). Queries revisited. Theoretical Computer Science, 313.

Anjos, O., Iglesias, C., Peres, F., Martı́nez, J., Garcia, A., & Taboada, J. (2015). Neural networks applied to discriminate botanical origin of honeys. Food Chemistry, 175, 128–136.

Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge University Press.

Antorán, J., Padhy, S., Barbano, R., Nalisnick, E. T., Janz, D., & Hernández-Lobato, J. M. (2023). Sampling-based inference for large linear models, with application to linearised Laplace. International Conference on Learning Representations.

Antos, A., & Kontoyiannis, I. (2001). Convergence properties of functional estimates for discrete distributions. Random Structures and Algorithms, 19(3-4).

Apostolico, A., & Bejerano, G. (2000). Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Jcb, 7(3), 381–393.

Arimoto, S. (1972). An algorithm for computing the capacity of discrete memoryless channel. IEEE Transactions on Information Theory, 18.

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3), 337–404.

Arora, S., Cohen, N., & Hazan, E. (2018a). On the optimization of deep networks: Implicit acceleration by overparameterization. International Conference on Machine Learning, 244–253.

Arora, S., Ge, R., Neyshabur, B., & Zhang, Y. (2018b). Stronger generalization bounds for deep nets via a compression approach. International Conference on Machine Learning, 254–263.

Arora, S., Liang, Y., & Ma, T. (2016). Why are deep nets reversible: A simple theory, with implications for training. https://arxiv.org/abs/1511.05653

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., & Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25, 25–29.

Asmuth, J., Li, L., Littman, M. L., Nouri, A., & Wingate, D. (2009). A Bayesian sampling approach to exploration in reinforcement learning. Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Asuncion, A., & Newman, D. J. (2007a). UCI machine learning repository.

Asuncion, A., & Newman, D. J. (2007b). UCI machine learning repository. University of California, Irvine, School of Information; Computer Sciences. www.ics.uci.edu/~mlearn/MLRepository.html

Athreya, K. B., & Lahiri, S. N. (2006). Measure theory and probability theory. Springer.

Attwood, T. K., Croning, M. D., Flower, D. R., Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J. N., & Wright, W. (2000). PRINTS-S: The database formerly known as PRINTS. Nucleic Acids Research, 28(1), 225–227.

Audibert, J. Y., Munos, R., & Szepesvári, C. (2009). Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science.

Audibert, J.-Y., & Bousquet, O. (2007). Combining PAC-Bayesian and generic chaining bounds. Journal of Machine Learning Research.

Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. Proceedings of the Conference on Learning Theory (COLT).

Audibert, J.-Y., & Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11.

Audibert, J.-Y., Bubeck, S., & Munos, R. (2010). Best arm identification in multi-armed bandits. Proceedings of the Conference on Learning Theory (COLT).

Auer, P. (2002). Using confidence bounds for exploration-exploitation trade-offs. Journal of Machine Learning Research, 3.

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47.

Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. Annual IEEE Symposium on Foundations of Computer Science.

Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1).

Auer, P., Cesa-Bianchi, N., & Gentile, C. (2002c). Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64.

Auer, P., & Chiang, C.-K. (2016). An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. Proceedings of the Conference on Learning Theory (COLT).

Auer, P., & Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2), 55–65.

Avner, O., Mannor, S., & Shamir, O. (2012). Decoupling exploration and exploitation in multi-armed bandits. Proceedings of the International Conference on Machine Learning (ICML).

Azuma, K. (1967). Weighted sums of certain dependent random variables. Tôhoku Mathematical Journal, 19(3).

Bach, F. (2017). On the equivalence between quadrature rules and random features. Advances in Neural Information Processing Systems 30, 456–467.

Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. Annual IEEE Symposium on Foundations of Computer Science.

Bairoch, A., & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28(1), 45–48.

Baldi, P., Sadowski, P., & Whiteson, D. (2014). Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5(1), 1–9.

Balog, M., Salakhutdinov, R., & Ghahramani, Z. (2016). Mondrian forests for large-scale regression when uncertainty matters. International Conference on Artificial Intelligence and Statistics (AISTATS), 1119–1127.

Banerjee, A. (2006a). On Bayesian bounds. Proceedings of the International Conference on Machine Learning (ICML).

Banerjee, A. (2006b). On Bayesian bounds. International Conference on Machine Learning, 81–88.

Banerjee, A., Dhillon, I. S., Ghosh, J., Merugu, S., & Modha, D. S. (2007). A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. Journal of Machine Learning Research, 8.

Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge University Press.

Barnard, K., Duygulu, P., & Forsyth, D. (2002). Modeling the statistics of image features and associated text. SPIE Electronic Imaging 2002, Document Recognition and Retrieval IX.

Barndorff-Nielsen, O. (2014). Information and exponential families: In statistical theory. John Wiley & Sons.

Barron, A., & Cover, T. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory.

Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44, 2743–2760.

Bartlett, P. L., Boucheron, S., & Lugosi, G. (2001). Model selection and error estimation. Machine Learning.

Bartlett, P. L., Collins, M., Taskar, B., & McAllester, D. (2005). Exponentiated gradient algorithms for large-margin structured classification. Advances in Neural Information Processing Systems (NeurIPS).

Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30.

Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1), 2285–2301.

Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48), 30063–30070.

Bartlett, P. L., & Mendelson, S. (2001). Rademacher and Gaussian complexities: Risk bounds and structural results. Proceedings of the Conference on Learning Theory (COLT).

Bartlett, P. L., Montanari, A., & Rakhlin, A. (2021). Deep learning: A statistical viewpoint. Acta Numerica, 30, 87–201.

Bartlett, P. L., & Tewari, A. (2009). REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Bartlett, P., Maiorov, V., & Meir, R. (1998). Almost linear VC dimension bounds for piecewise polynomial networks. Advances in Neural Information Processing Systems, 11.

Bartók, G., Foster, D., Pál, D., Rakhlin, A., & Szepesvári, C. (2014). Partial monitoring – classification, regret bounds, and algorithms. Mathematics of Operations Research, 36(4).

Bartók, G., Pál, D., & Szepesvári, C. (2011). Minimax regret of finite partial-monitoring games in stochastic environments. Proceedings of the Conference on Learning Theory (COLT).

Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., & Sonnhammer, E. L. (2000). The Pfam protein families database. Nucleic Acids Research, 28(1), 263–266.

Bauer, M., Wilk, M. van der, & Rasmussen, C. E. (2016). Understanding probabilistic sparse Gaussian process approximations. Advances in Neural Information Processing Systems, 29, 1533–1541.

Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference (Publication May; pp. 1–281) [PhD thesis]. Gatsby Computational Neuroscience Unit, University College London.

Becker, R. A. (2012). The variance drain and Jensen’s inequality. CAEPR Working Paper No. 2012-004. http://dx.doi.org/10.2139/ssrn.2027471

Behboodi, A., Cesa, G., & Cohen, T. S. (2022). A PAC-Bayesian generalization bound for equivariant networks. Advances in Neural Information Processing Systems, 35, 5654–5668.

Bejerano, G., Seldin, Y., Margalit, H., & Tishby, N. (2001). Markovian domain fingerprinting: Statistical segmentation of protein sequences. Bioinformatics, 17(10), 927–934.

Bejerano, G., & Yona, G. (1999). Modeling protein families using probabilistic suffix trees. In S. Istrail, P. Pevzner, & M. Waterman (Eds.), RECOMB99: 3rd international conference on computational molecular biology (pp. 15–24).

Bejerano, G., & Yona, G. (2001). Variations on probabilistic suffix trees: Statistical modeling and prediction of protein families. BioInfo, 17(1), 23–43.

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–15854.

Ben-David, S., & Luxburg, U. von. (2008). Relating clustering stability to properties of cluster boundaries. Proceedings of the Conference on Learning Theory (COLT).

Ben-David, S., Luxburg, U. von, & Pál, D. (2006). A sober look on clustering stability. Advances in Neural Information Processing Systems (NeurIPS).

Ben-David, S., Pál, D., & Simon, H.-U. (2007). Stability of k-means clustering. Proceedings of the Conference on Learning Theory (COLT).

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning.

Ben-Hur, A., Elisseeff, A., & Guyon., I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing.

Berend, D., & Kontorovich, A. (2016). A finite sample analysis of the naive Bayes classifier. Journal of Machine Learning Research.

Bergamin, F., Moreno-Muñoz, P., Hauberg, S., & Arvanitidis, G. (2023). Riemannian Laplace approximations for Bayesian neural networks. Advances in Neural Information Processing Systems.

Berger, J. O., Moreno, E., Pericchi, L. R., Bayarri, M. J., Bernardo, J. M., Cano, J. A., De la Horra, J., Martı́n, J., Rı́os-Insúa, D., Betrò, B., et al. (1994). An overview of robust Bayesian analysis. Test, 3(1), 5–124.

Bergmann, S., Stelzer, S., & Strassburger, S. (2014). On the use of artificial neural networks in simulation-based manufacturing control. Journal of Simulation, 8(1), 76–90.

Berk, R. H. et al. (1966). Limiting behavior of posterior distributions when the model is incorrect. The Annals of Mathematical Statistics, 37(1), 51–58.

Bernardo, J. M., & Smith, A. F. (2009). Bayesian theory (Vol. 405). John Wiley & Sons.

Bernstein, S. N. (1946). Probability theory (4^th).

Bertin-Mahieux, T. (2011). Year Prediction MSD. UCI Machine Learning Repository.

Beygelzimer, A., Dasgupta, S., & Langford, J. (2009). Importance weighted active learning. Proceedings of the International Conference on Machine Learning (ICML).

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandit algorithms with supervised learning guarantees. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., & Schapire, R. E. (2010). Contextual bandit algorithms with supervised learning guarantees. http://arxiv.org/abs/1002.4058.

Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neural Computation, 13, 2409–2463.

Bian, Y., & Chen, H. (2021). When does diversity help generalization in classification ensembles? IEEE Transactions on Cybernetics.

Bietti, A., & Mairal, J. (2019). Group invariance principles for causal generative models. The 22nd International Conference on Artificial Intelligence and Statistics, 557–566.

Bietti, A., Venturi, L., & Bruna, J. (2021). On the sample complexity of learning under geometric stability. Advances in Neural Information Processing Systems, 34.

Billingsley, P. (1995). Probability and measure (3rd ed.). John Wiley & Sons.

Bishop, C. M., & Svensen, M. (2003). Bayesian hierarchical mixtures of experts. Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Bishop, C. M. (1998). Latent variable models. In Learning in graphical models (pp. 371–403). Springer.

Bishop, C. M. (1995a). Neural networks for pattern recognition. Clarendon press.

Bishop, C. M. (1995b). Neural networks for pattern recognition. Oxford University Press.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Bissiri, P. G., Holmes, C. C., & Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5), 1103–1130.

Blahut, R. E. (1972). Computation of channel capacity and rate distortion functions. IEEE Transactions on Information Theory, 18.

Blanchard, G., & Fleuret, F. (2007). Occam’s hammer. Proceedings of the Conference on Learning Theory (COLT).

Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1, 203–232.

Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773

Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning, 113–120.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Blei, D., & Lafferty, J. (2009). Topic models. In A. Srivastava & M. Sahami (Eds.), Text mining: Theory and applications. Taylor; Francis.

Blight, B., & Ott, L. (1975). A Bayesian approach to model inadequacy for polynomial regression. Biometrika, 62(1), 79–88.

Bloem-Reddy, B., & Teh, Y. W. (2020). Probabilistic symmetries and invariant neural networks. Journal of Machine Learning Research, 21, "90–1".

Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural networks. International Conference on Machine Learning, 1613–1622.

Bogachev, V. I. (1998). Gaussian measures (Vol. 62). American Mathematical Society.

Bogachev, V. I. (2007). Measure theory. Springer.

Bonilla, E. V., Krauth, K., & Dezfouli, A. (2018). Generic inference in latent Gaussian process models. https://arxiv.org/abs/1609.00577

Bonilla, E. V., Krauth, K., & Dezfouli, A. (2019). Generic inference in latent Gaussian process models. J. Mach. Learn. Res., 20, 117–111.

Borchani, H., Martı́nez, A. M., Masegosa, A. R., Langseth, H., Nielsen, T. D., Salmerón, A., Fernández, A., Madsen, A. L., & Sáez, R. (2015). Modeling concept drift: A probabilistic graphical model based approach. International Symposium on Intelligent Data Analysis, 72–83.

Bork, P. (1992). Mobile modules and motifs. Curr. Opin. Struct. Biol., 2, 413–421.

Bork, P., & Koonin, E. V. (1996). Protein sequence motifs. Curr. Opin. Struct. Biol., 6(3), 366–376.

Botev, A., Ritter, H., & Barber, D. (2017). Practical Gauss-Newton optimisation for deep learning. International Conference on Machine Learning, 557–565.

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.

Bottou, L., & Bousquet, O. (2011). The tradeoffs of large scale learning. In S. Sra, S. Nowozin, & S. J. Wright (Eds.), Optimization for machine learning (pp. 351–368). MIT Press.

Boucheron, S., Lugosi, G., & Bousquet, O. (2004). Concentration inequalities. In O. Bousquet, U. v. Luxburg, & G. Rätsch (Eds.), Advanced lectures in machine learning. Springer.

Boucheron, S., Lugosi, G., & Bousquet, O. (2005). Theory of classification: A survey of recent advances. ESAIM: Probability and Statistics.

Boucheron, S., Lugosi, G., & Massart, P. (2013a). Concentration inequalities A nonasymptotic theory of independence. Oxford University Press.

Boucheron, S., Lugosi, G., & Massart, P. (2013b). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.

Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research.

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799.

Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.

Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2).

Breiman, L. (1996b). Bagging predictors. Machine Learning, 24(2), 123–140.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

Bresler, G., Mossel, E., & Sly, A. (2008). Reconstruction of Markov random fields from samples: Some easy observations and algorithms. 11^th International Workshop, APPROX 2008, and 12^th International Workshop, RANDOM 2008, LNCS 5171.

Broderick, T., Boy, N., Wibisono, A., Wilson, A. C., & Jordan, M. I. (2013). Streaming variational Bayes. In Advances in neural information processing systems 26 (pp. 1727–1735). Curran Associates, Inc.

Bronstein, M. M., Bruna, J., Cohen, T., & Veličković, P. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv Preprint arXiv:2104.13478.

Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of markov chain monte carlo. CRC Press.

Brost, B., Cox, I. J., Seldin, Y., & Lioma, C. (2016a). An improved multileaving algorithm for online ranker evaluation. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval : SIGIR ’16. Association for Computing Machinery.

Brost, B., Seldin, Y., Cox, I. J., & Lioma, C. (2016b). Multi-dueling bandits and their application to online ranker evaluation. Proceeding of the 25th ACM International Conference on Information and Knowledge Management (CIKM).

Brown, G. (2009). An information theoretic perspective on multiple classifier systems. International Workshop on Multiple Classifier Systems, 344–353.

Brown, G., Wyatt, J. L., & Tiňo, P. (2005). Managing diversity in regression ensembles. Journal of Machine Learning Research, 6(Sep), 1621–1650.

Brown, L. D. (1986). Fundamentals of statistical exponential families: With applications in statistical decision theory.

Bu, Y., Zou, S., & Veeravalli, V. V. (2020). Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1), 121–130. https://doi.org/10.1109/JSAIT.2020.2991139

Bubeck, S. (2010). Bandits games and clustering foundations [PhD thesis]. Université Lille.

Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5.

Bubeck, S., Li, Y., & Nagaraj, D. M. (2021). A law of robustness for two-layers neural networks. Conference on Learning Theory, 804–820.

Bubeck, S., & Sellke, M. (2023). A universal law of robustness via isoperimetry. Journal of the ACM, 70(2), 1–18.

Bubeck, S., & Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. Proceedings of the Conference on Learning Theory (COLT).

Buetler, T. M., & Eaton, D. L. (1992). Glutathione S-transferases: Amino acid sequence comparison, classification and phylogentic relationship. Environ. Carcinogen. Ecotoxicol. Rev., C10, 181–203.

Bui, T. D., Nguyen, C. V., & Turner, R. E. (2016a). Deep Gaussian processes for regression using approximate expectation propagation. International Conference on Machine Learning (ICML), 1472–1481.

Bui, T. D., Yan, J., & Turner, R. E. (2017). A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research, 18, 1–72.

Bui, T., Hernández-Lobato, D., Hernandez-Lobato, J., Li, Y., & Turner, R. (2016b). Deep Gaussian processes for regression using approximate expectation propagation. International Conference on Machine Learning, 1472–1481.

Buntine, W. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2.

Buntine, W., & Jaakkola, T. (2022). Alpha–divergences, expectation propagation and Bayesian neural networks. International Conference on Artificial Intelligence and Statistics (AISTATS), 1234–1242.

Burt, D. R., & Rasmussen, C. E. (2020). Convergence of sparse variational inference in Gaussian processes. Journal of Machine Learning Research, 21(131), 1–63.

Buschjäger, S., Pfahler, L., & Morik, K. (2020). Generalized negative correlation learning for deep ensembling. arXiv Preprint arXiv:2011.02952.

Cabañas, R., Martı́nez, A. M., Masegosa, A. R., Ramos-López, D., Samerón, A., Nielsen, T. D., Langseth, H., & Madsen, A. L. (2016). Financial data analysis with PGMs using AMIDST. Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference On, 1284–1287.

Cantelli, F. (1933). Sulla determinazione empirica della leggi di probabilita. G. Inst. Ital. Attuari, 4.

Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., & Stoltz, G. (2013). Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3).

Casado, I., Ortega, L. A., Masegosa, A. R., & Pérez, A. (2024). PAC-Bayes-Chernoff bounds for unbounded losses. Proceedings of the 38th Conference on Neural Information Processing Systems.

Catoni, O. (2007a). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. IMS Lecture Notes Monograph Series, 56.

Catoni, O. (2007b). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. arXiv Preprint arXiv:0712.0248.

Cesa-Bianchi, N., & Fischer, P. (1998). Finite-time regret bounds for the multiarmed bandit problem. Proceedings of the International Conference on Machine Learning (ICML).

Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44(3).

Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press.

Cesa-Bianchi, N., & Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and Systems Sciences, 78.

Cesa-Bianchi, N., Lugosi, G., & Stoltz, G. (2005). Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51.

Cesa-Bianchi, N., Mansour, Y., & Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning, 66.

Cesa-Bianchi, N., & Shamir, O. (2017). Bandit regret scaling with the effective loss range. https://arxiv.org/abs/1705.05091.

Chafaı̈, D. (2004). Entropies, convexity, and functional inequalities, on Φ-entropies and Φ-sobolev inequalities. Journal of Mathematics of Kyoto University, 44(2), 325–363.

Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of the Association of Computing Machinery, 13, 547–569.

Chandra, A., & Yao, X. (2004). DIVACE: Diverse and accurate ensemble learning algorithm. International Conference on Intelligent Data Engineering and Automated Learning, 619–625.

Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2.

Chapelle, O., & Li, L. (2011). An empirical evaluation of thompson sampling. Advances in Neural Information Processing Systems (NeurIPS).

Chaudhari, P., & Soatto, S. (2018). Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. International Conference on Learning Representations. https://arxiv.org/abs/1710.11029

Chechik, G., Globerson, A., Tishby, N., & Weiss, Y. (2005). Gaussian information bottleneck. Journal of Machine Learning Research, 6, 165–188.

Chechik, G., & Tishby, N. (2002). Extracting relevant structures with side information. Advances in Neural Information Processing Systems (NeurIPS).

Chen, B., & Frazier, P. I. (2017). Dueling bandits with weak regret. Proceedings of the International Conference on Machine Learning (ICML).

Chen, C.-P., & Qi, F. (2003). The best lower and upper bounds of harmonic sequence.

Chen, S., Dobriban, E., & Lee, J. H. (2020). A group-theoretic framework for data augmentation. The Journal of Machine Learning Research, 21(1), 9885–9955.

Chen, S., Dobriban, E., & Lee, J. D. (2019). Invariance reduces variance: Understanding data augmentation in deep learning and beyond. arXiv Preprint arXiv:1907.10905.

Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. International Conference on Machine Learning, 1683–1691.

Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2939672.2939785

Chen, X., Irie, K., Banks, D., Haslinger, R., Thomas, J., & West, M. (2018). Scalable Bayesian modeling, monitoring, and analysis of dynamic network flow data. Journal of the American Statistical Association, 113(522), 519–533.

Cheng, C.-A., & Boots, B. (2016). Incremental variational sparse Gaussian process regression. Advances in Neural Information Processing Systems, 29, 4403–4411.

Cheng, C.-A., & Boots, B. (2017). Variational inference for Gaussian process models with linear complexity. Advances in Neural Information Processing Systems, 30, 5184–5194.

Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. Proceedings of the 8^th International Conference on Intelligent Systems for Molecular Biology (ISMB).

Chérief-Abdellatif, B.-E., & Alquier, P. (2019). MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. arXiv Preprint arXiv:1909.13339.

Chernoff, H. (1952a). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23.

Chernoff, H. (1952b). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 493–507.

Cho, H., & Dhillon, I. S. (2008). Co-clustering of human cancer microarrays using minimum sum-squared residue co-clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 5(3).

Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum sum-squared residue co-clustering of gene expression data. Proceedings of the Fourth SIAM International Conference on Data Mining.

Cho, Y., & Saul, L. (2009). Kernel methods for deep learning. Advances in Neural Information Processing Systems, 22.

Choi, T., Ramamoorthi, R., et al. (2008). Remarks on consistency of posterior distributions. In Pushing the limits of contemporary statistics: Contributions in honor of jayanta k. ghosh (pp. 170–186). Institute of Mathematical Statistics.

Chow, C. K., & Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3), 462–467.

Chui, C. (1992). An introduction to wavelets.

Chung, F., & Lu, L. (2006). Concentration inequalities and martingale inequalities: A survey. Internet Mathematics, 3(2).

Claesen, M., Smet, F. D., Suykens, J. A. K., & Moor, B. D. (2014). EnsembleSVM: A library for ensemble learning using support vector machines. Journal of Machine Learning Research, 15(1).

Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(5).

Composite Learning for Artificial Cognitive Systems (CompLACS). (n.d.).

Conway, J. B. (1990). A course in functional analysis (2nd ed.). Springer.

Corfield, D., Schölkopf, B., & Vapnik, V. N. (2009). Falsification and statistical learning theory: Comparing the Popper and Vapnik-Chervonenkis dimensions. Journal for General Philosophy of Science, 40, 51–58.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3).

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. https://archive.ics.uci.edu/ml/datasets/wine+quality

Cover, T. M. (1972). Admissibility properties of Gilbert?s encoding for unknown source probabilities. IEEE Transactions on Information Theory, 18.

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons.

Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Wiley Series in Telecommunications; Signal Processing.

Cover, T. M., & Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons.

Cowell, R. G., Dawid, A. P., Lauritzen, S. L., & Spiegelhalter, D. J. (2007). Probabilistic networks and expert systems. Exact computational methods for bayesian networks. Springer.

Cramér, H. (1938). Sur un nouveau théoreme-limite de la théorie des probabilités. Actual. Sci. Ind., 736, 5–23.

Crammer, K., Mohri, M., & Pereira, F. (2009). Gaussian margin machines. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge University Press.

Csiszar, I. (1974). On the computation of rate distortion functions. IEEE Transactions on Information Theory, 20, 122–124.

Csiszár, I., & Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics & Decisions, Supplement Issue 1.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 113–123.

Cunningham, P., & Carney, J. (2000). Diversity versus quality in classification ensembles based on feature selection. European Conference on Machine Learning, 109–116.

Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2017). Random feature expansions for deep Gaussian processes. International Conference on Machine Learning, 884–893.

Da Prato, G., & Zabczyk, J. (2014). Stochastic equations in infinite dimensions (2nd ed.). Cambridge University Press.

Dai, Z., Damianou, A., González, J., & Lawrence, N. (2016). Variational auto-encoded deep Gaussian processes. 4th International Conference on Learning Representations, ICLR 2016.

Damianou, A., & Lawrence, N. D. (2013). Deep Gaussian processes. Artificial Intelligence and Statistics, 207–215.

Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. Proceedings of the Conference on Learning Theory (COLT).

Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., & Ré, C. (2019). A kernel theory of modern data augmentation. International Conference on Machine Learning, 1528–1537.

Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., & Hennig, P. (2021a). Laplace Redux - Effortless Bayesian deep learning. Advances in Neural Information Processing Systems, 34, 20089–20103.

Daxberger, E., Nalisnick, E., Allingham, J. U., Antoran, J., & Hernandez-Lobato, J. M. (2021b). Bayesian deep learning via subnetwork inference. International Conference on Machine Learning, 2510–2521.

Deisenroth, M. P., Fox, D., & Rasmussen, C. E. (2015). Gaussian processes for data-efficient learning in robotics and control (Vol. 118). Springer.

Dembo, A., & Zeitouni, O. (1998). Large deviations techniques and applications (2nd ed., Vol. 38). Springer. https://doi.org/10.1007/978-1-4612-5320-3

Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1), 1–38.

Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29, 141–142.

Deng, Z., Zhou, F., & Zhu, J. (2022). Accelerated linearized Laplace approximation for Bayesian deep learning. Advances in Neural Information Processing Systems, 35, 2695–2708.

Deng, Z., & Zhu, J. (2023). Bayesadapter: Being Bayesian, inexpensively and reliably, via Bayesian fine-tuning. Asian Conference on Machine Learning, 280–295.

Derbeko, P., El-Yaniv, R., & Meir, R. (2004). Explicit learning curves for transduction and application to clustering and compression algorithms. Journal of Artificial Intelligence Research, 22.

Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer.

Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density estimation. Springer.

Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-clustering. Proceedings of the International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD).

Dietterich, T. G. (2000). Ensemble methods in machine learning. International Workshop on Multiple Classifier Systems, 1–15.

Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. Proceedings of the International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD).

Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1019–1028.

Do, M. N., & Vetterli, M. (2000). Texture similarity measurement using kullback-leibler distance on wavelet subbands. In Proceedings of IEEE International Conference on Image Processing, ICIP-2000.

Domingues, R., Michiardi, P., Zouaoui, J., & Filippone, M. (2018). Deep Gaussian process autoencoders for novelty detection. Machine Learning, 107(8), 1363–1383.

Donoho, D., & Huo, X. (2001). Beamlets and multiscale image analysis. Lecture Notes in Computational Science and Engineering: Multiscale and Multiresolution Methods. Springer.

Donsker, M. D., & Varadhan, S. R. S. (1975a). Asymptotic evaluation of certain Markov process expectations for large time. Communications on Pure and Applied Mathematics, 28.

Donsker, M. D., & Varadhan, S. S. (1975b). Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1), 1–47.

Doshi-Velez, F., & Kim, B. (2017). Towards A Rigorous Science of Interpretable Machine Learning. arXiv e-Prints.

Doucet, A., Godsill, S., & Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3), 197–208.

Drucker, H., & Le Cun, Y. (1992). Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3(6), 991–997.

Du, S. S., Zhai, X., Poczos, B., & Singh, A. (2019). Gradient descent provably optimizes over-parameterized neural networks. International Conference on Learning Representations. https://openreview.net/forum?id=S1eK3i09YQ

Dua, D., & Graff, C. (2019). UCI machine learning repository. University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml

Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. Wiley-Interscience.

Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. John Wiley & Sons.

Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., & Zoghi, M. (2015). Contextual dueling bandits. Proceedings of the Conference on Learning Theory (COLT).

Dudík, M., Phillips, S. J., & Schapire, R. E. (2007). Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8.

Dupuis, P., & Ellis, R. S. (1997). A weak convergence approach to the theory of large deviations. Wiley-Interscience.

Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press.

Durrett, R. (2019). Probability: Theory and examples (5th ed.). Cambridge University Press.

Dusenberry, M. W., Jerfel, G., Wen, Y., Ma, Y., Snoek, J., Heller, K., Lakshminarayanan, B., & Tran, D. (2020). Efficient and scalable Bayesian neural nets with rank-1 factors. arXiv Preprint arXiv:2005.07186.

Dutordoir, V., Durrande, N., & Hensman, J. (2020). Sparse Gaussian processes with spherical harmonic features. International Conference on Machine Learning, 2793–2802.

Duvenaud, D., Rippel, O., Adams, R., & Ghahramani, Z. (2014). Avoiding pathologies in very deep networks. Artificial Intelligence and Statistics, 202–210.

Dwivedi, R., Khamaru, K., Wainwright, M. J., Jordan, M. I., et al. (2018). Theoretical guarantees for EM under misspecified Gaussian mixture models. Advances in Neural Information Processing Systems, 9681–9689.

Dziugaite, G. K., & Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv Preprint arXiv:1703.11008.

E, W., Engquist, B., Li, X., Ren, W., & Vanden-Eijnden, E. (2004). The heterogeneous multiscale method: A review.

Eckhardt, D. E., & Lee, L. D. (1985). A theoretical basis for the analysis of multiversion software subject to coincident errors. IEEE Transactions on Software Engineering, SE-11(12).

Elesedy, B. (2021). Provably strict generalisation benefit for invariance in kernel methods. Advances in Neural Information Processing Systems, 34.

Elesedy, B. (2022a). Group symmetry in PAC learning. ICLR 2022 Workshop on Geometrical and Topological Representation Learning.

Elesedy, B., & Zaidi, S. (2021a). Provably strict generalisation benefit for equivariant models. International Conference on Machine Learning, 2959–2969.

Elesedy, B. (2022b). Group symmetry in PAC learning. ICLR 2022 Workshop on Geometrical and Topological Representation Learning.

Elesedy, B., & Zaidi, S. (2021b). Provably strict generalisation benefit for equivariant models. International Conference on Machine Learning, 2959–2969.

Ellis, R. S. (2012). Entropy, large deviations, and statistical mechanics (Vol. 271). Springer Science & Business Media.

El-Yaniv, R., Fine, S., & Tishby, N. (1998). Agnostic classification of markovian sequences. Advances in Neural Information Processing Systems (NeurIPS).

El-Yaniv, R., & Souroujon, O. (2001). Iterative double clustering for unsupervised and semi-supervised learning. Advances in Neural Information Processing Systems (NeurIPS).

Enright, A. J., Iliopoulos, I., Kyrpides, N. C., & Ouzounis, C. A. (1999). Protein interaction maps for complete genomes based on gene fusion events. Nature, 402(6757), 86–90.

Erven, T. van, Kotłowski, W., & Warmuth, M. K. (2014). Follow the leader with dropout perturbations. Proceedings of the Conference on Learning Theory (COLT).

Eskin, E., Grundy, W., & Singer, Y. (2000). Protein family classifiction using sparse markov transducers. ISMB2000.

Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7.

Fard, M. M., & Pineau, J. (2010). PAC-Bayesian model selection for reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS).

Feinholz, L. (1979). Estimation of the performance of partitioning algorithms in pattern classification [Master’s thesis]. Department of Mathematics, McGill University.

Feldman, V. (2020). Does learning require memorization? A short tale about a long tail. Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 954–959.

Feldman, V., & Zhang, C. (2020). What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33, 2881–2891.

Feller. (1971). An introduction to probability theory and its applications. Wiley-Interscience.

Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008.

Feng, J., & Kurtz, T. G. (2006). Large deviations for stochastic processes (Vol. 131). American Mathematical Society. https://doi.org/10.1090/surv/131

Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch geometric. Representation Learning on Graphs and Manifolds Workshop, ICLR. https://arxiv.org/abs/1903.02428

Filippone, M., & Engler, U. (2015). Pseudo-marginal Bayesian inference for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 546–560.

Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical Hidden Markov Model: Analysis and applications. Machine Learning, 32, 41–62.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A 222, 309–368.

Fisher, R. A. (1925). Theory of statistical estimation. Trans. Cambridge Philos. Soc., 22, 700.

Flam-Shepherd, D., Requeima, J., & Duvenaud, D. (2017). Mapping Gaussian process priors to Bayesian neural networks. NIPS Bayesian Deep Learning Workshop, 3.

Föll, R., & Steinwart, I. (2019). PAC-Bayesian bounds for deep Gaussian processes. arXiv Preprint arXiv:1909.09985.

Foong, A. Y., Li, Y., Hernández-Lobato, J. M., & Turner, R. E. (2019). “In-Between” Uncertainty in Bayesian neural networks. ICML Workshop on Uncertainty and Robustness in Deep Learning.

Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-aware minimization for efficiently improving generalization. International Conference on Learning Representations. https://openreview.net/forum?id=6Tm1mposlrM

Fort, S., Hu, H., & Lakshminarayanan, B. (2019). Deep ensembles: A loss landscape perspective. arXiv Preprint arXiv:1912.02757.

Fortunati, S., Gini, F., Greco, M. S., & Richmond, C. D. (2017). Performance bounds for parameter estimation under misspecified models: Fundamental findings and applications. IEEE Signal Processing Magazine, 34(6), 142–157.

Foster, D. P., & Rakhlin, A. (2012). No internal regret via neighborhood watch. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Freedman, D. A. (1975). On tail probabilities for martingales. The Annals of Probability, 3(1).

Freidlin, M. I., & Wentzell, A. D. (1984). Random perturbations of dynamical systems (Vol. 260). Springer. https://doi.org/10.1007/978-1-4612-0197-6

Freitag, D. (2004). Trained named entity recognition using distributional clusters. Proceedings of EMNLP.

Frejstrup Maibing, S., & Igel, C. (2015). Computational complexity of linear large margin classification with ramp loss. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Freund, Y., & Ron, D. (1995). Learning to model sequences generated by switching distributions. COLT ’95: Proceedings of the Eighth Annual Conference on Computational Learning Theory, 8, 41–50.

Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Proceedings of the International Conference on Machine Learning (ICML).

Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780), 1612.

Friedman, B. (1990). Principles and techniques of applied mathematics. Courier Dover Publications.

Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.

Friedman, N. (1998). The Bayesian structural EM algorithm. Proceedings of the Conference on Uncertainty in Artificial Intelligence, 129–138.

Friedman, N., & Koller, D. (2003). Being bayesian about network structure: A bayesian approach to structure discovery in bayesian networks. Machine Learning Journal.

Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams: A review. ACM Sigmod Record, 34(2), 18–26.

Gabillon, V., Ghavamzadeh, M., & Lazaric, A. (2012). Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems (NeurIPS).

Gaillard, P., Stoltz, G., & Erven, T. van. (2014). A second-order bound with excess losses. Proceedings of the Conference on Learning Theory (COLT).

Gajane, P., Urvoy, T., & Clérot, F. (2015). A relative exponential weighing algorithm for adversarial utility-based dueling bandits. Proceedings of the International Conference on Machine Learning (ICML).

Gal, Y. (2016). Uncertainty in deep learning [PhD thesis]. University of Cambridge.

Gal, Y., & Turner, R. (2015a). Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. International Conference on Machine Learning, 655–664.

Gal, Y., & Turner, R. E. (2015b). Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. International Conference on Machine Learning (ICML), 655–664.

Gama, J., & Rodrigues, P. P. (2009). An overview on mining data streams. In Foundations of computational, IntelligenceVolume 6 (pp. 29–45). Springer.

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 44:1–44:37.

Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., & Wilson, A. G. (2018). GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. Advances in Neural Information Processing Systems 31.

Garivier, A., & Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. Proceedings of the Conference on Learning Theory (COLT).

Gastpar, M., Nachum, I., Shafer, J., & Weinberger, T. (2024b). Fantastic generalization measures are nowhere to be found. Proceedings of the 12th International Conference on Learning Representations (ICLR 2024).

Gastpar, M., Nachum, I., Shafer, J., & Weinberger, T. (2024a). Fantastic generalization measures are nowhere to be found. International Conference on Learning Representations.

Gelfand, A. E., & Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398–409.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). CRC Press.

Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.

Geng, C., Wang, J., Gao, Z., Frellsen, J., & Hauberg, S. (2021). Bounds all around: Training energy-based models with bidirectional bounds. Advances in Neural Information Processing Systems (NeurIPS) 34.

Genuer, R. (2012). Variance reduction in purely random forests. Journal of Nonparametric Statistics, 24(3), 543–562.

George, T., & Merugu, S. (2005). A scalable collaborative filtering framework based on co-clustering. Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM?05).

Gerchinovitz, S., & Lattimore, T. (2016). Refined lower bounds for adversarial bandits. Advances in Neural Information Processing Systems (NeurIPS).

Germain, P., Bach, F., Lacoste, A., & Lacoste-Julien, S. (2016). PAC-Bayesian theory meets Bayesian inference. Advances in Neural Information Processing Systems, 1884–1892.

Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2006). PAC-Bayes risk bounds for general loss functions. Advances in Neural Information Processing Systems (NeurIPS).

Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2009b). PAC-Bayesian learning of linear classifiers. Proceedings of the 26th International Conference on Machine Learning (ICML 2009), 353–360.

Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2009a). PAC-Bayesian learning of linear classifiers. Proceedings of the International Conference on Machine Learning (ICML).

Germain, P., Lacasse, A., Laviolette, F., Marchand, M., & Roy, J.-F. (2015). Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16.

Germain, P., Lacoste, A., Laviolette, F., Marchand, M., & Shanian, S. (2011). A PAC-Bayes Sample Compression Approach to Kernel Methods. Proceedings of the International Conference on Machine Learning (ICML).

Getoor, L., Friedman, N., Koller, D., Pfeffer, A., & Taskar, B. (2007). Probabilistic relational models. In L. Getoor & B. Taskar (Eds.), Introduction to statistical relational learning. MIT Press.

Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553), 452–459.

Ghahramani, Z., & Attias, H. (2000). Online variational Bayesian learning. Slides from Talk Presented at NIPS Workshop on Online Learning.

Gilbert, A. C., Zhang, Y., Lee, K., Zhang, Y., & Lee, H. (2017). Towards understanding the invertibility of convolutional neural networks. 1703–1710. https://doi.org/10.24963/ijcai.2017/236

Gilbert, E. N. (1971). Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17(3).

Gilks, W. R., Richardson, S., & Spiegelhalter, D. (1995). Markov chain monte carlo in practice. CRC press.

Girard, A., Rasmussen, C. E., Quiñonero-Candela, J., & Murray-Smith, R. (2003). Gaussian process priors with uncertain inputs — application to multiple-step ahead time series forecasting. Advances in Neural Information Processing Systems 15 (NIPS 2002), 529–536.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587.

Glivenko, V. (1933). Sulla determinazione empirica di probabilita. G. Inst. Ital. Attuari, 4.

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.

Goldberger, J., Greenspan, H., & Gordon, S. (2002). Unsupervised image clustering using the information bottleneck method. DAGM-Symposium, 158–165.

Goldman, S. A., & Kearns, M. J. (1995). On the complexity of teaching. Journal of Computer and System Sciences, 50.

Golowich, N., Rakhlin, A., & Shamir, O. (2018). Size-independent sample complexity of neural networks. Conference on Learning Theory, 297–299.

Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3^rd). The Johns Hopkins University Press.

Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT press.

Goodfellow, I., Lee, H., Le, Q., Saxe, A., & Ng, A. (2009). Measuring invariances in deep networks. Advances in Neural Information Processing Systems, 22.

Gouk, H., Frank, E., Pfahringer, B., & Cree, M. J. (2021). Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110, 393–416.

Graepel, T., Herbrich, R., & Shawe-Taylor, J. (2005). PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1-2).

Graves, A. (2011). Practical variational inference for neural networks. Advances in Neural Information Processing Systems, 24, 2348–2356.

Gray, R. M. (2011). Entropy and information theory (2nd ed.). Springer.

Grimmett, G. R., & Stirzaker, D. R. (2001). Probability and random processes (3rd ed.). Oxford University Press.

Groot, M. D. (1970). Optimal statistical decisions. McGraw-Hill.

Gross, L. (1967). Abstract wiener spaces.

Gross, L. (2011). Lectures on the Gaussian measure. Mathematical Notes.

Grünwald, P. (2007a). The minimum description length principle. MIT Press.

Grünwald, P. (2012). The safe Bayesian: Learning the learning rate via the mixability gap. International Conference on Algorithmic Learning Theory, 169–183.

Grünwald, P. (2018). Safe probability. Journal of Statistical Planning and Inference, 195, 47–63.

Grünwald, P. D. (2007b). The minimum description length principle. MIT press.

Grünwald, P. D., & Mehta, N. A. (2016). Fast rates for general unbounded loss functions: From ERM to generalized Bayes. arXiv Preprint arXiv:1605.00252.

Grünwald, P., & Van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4), 1069–1103.

Guedj, B. (2019). A primer on PAC-Bayesian learning. arXiv Preprint arXiv:1901.05353.

Gummadi, K. P., Saroiu, S., & Gribble, S. D. (2002). King: Estimating latency between arbitrary internet end hosts. Proceedings of the 2^nd ACM SIGCOMM Workshop on Internet Measurement (IMW-2002).

Gunasekar, S., Lee, J. D., Soudry, D., & Srebro, N. (2018). Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31.

Gunsel, B., Ferman, A., & Tekalp, A. (1998). Temporal video segmentation using unsupervised clustering and semantic object tracking. Journal of Electronic Imaging, 7(3), 592–604.

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. International Conference on Machine Learning, 1321–1330.

Gupta, S., & Gupta, A. (2019). Dealing with noise problem in machine learning data-sets: A systematic review. Procedia Computer Science, 161, 466–474.

Guyon, I., Luxburg, U. von, & Williamson, R. C. (2009). Clustering: Science or art? Towards principled approaches. NIPS workshop.

Hajjo, R., Sabbah, D. A., Bardaweel, S. K., & Tropsha, A. (2021). Identification of tumor-specific MRI biomarkers using machine learning (ML). Diagnostics, 11(5), 742.

Hamilton, J. D. (1994). Time series analysis (Vol. 2). Princeton university press Princeton, NJ.

Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10).

Hardt, M., & Ma, T. (2017). Identity matters in deep learning. International Conference on Learning Representations.

Harries, M. (1999). Splice-2 comparative evaluation: Electricity pricing (\Notype UNSW-CSE-TR-9905). School of Computer Siene; Engineering, The University of New South Wales.

Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337).

Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C., & Teh, Y. W. (2017). Distributed Bayesian learning with stochastic natural gradient expectation propagation and the posterior server. Journal of Machine Learning Research, 18(106), 1–37.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (Second). Springer.

Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications.

Havasi, M., Hernández-Lobato, J. M., & Murillo-Fuentes, J. J. (2018). Inference in deep Gaussian processes using stochastic gradient Hamiltonian Monte Carlo. Advances in Neural Information Processing Systems, 31.

Hayes, J. D., & Pulford, D. J. (1995). The glutathione S-transferase supergene family: Regulation of GST and the contribution of the isoenzymes to cancer chemoprotection and drug resistance. Cri. Rev. Biochem. Mol. Biol., 30(6), 445–600.

Hazan, E., & Kale, S. (2009). Better algorithms for benign bandits. Proceedings of the Annyal ACM-SIAM Symposium on Discrete Algorithms (SODA).

Hazan, E., & Kale, S. (2011). Better algorithms for benign bandits. Journal of Machine Learning Research, 12.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision, 1026–1034.

He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.

He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity mappings in deep residual networks. European Conference on Computer Vision, 630–645.

He, X., Zemel, R., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labelling. CVPR-2004: IEEE Conference on Computer Vision and Pattern Recognition.

Heckerman, D., Geiger, D., & Chickering, D. M. (1995a). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.

Heckerman, D., Meek, C., & Koller, D. (2007). Probabilistic entity-relationship models, PRMs, and plate models. In L. Getoor & B. Taskar (Eds.), Introduction to statistical relational learning. MIT Press.

Heckerman, D., Geiger, D., & Chickering., D. (1995b). Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.

Heitz, G., Gould, S., Saxena, A., & Koller, D. (2009). Cascaded classification models: Combining models for holistic scene understanding. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems 21.

Hensman, J., Durrande, N., Solin, A., et al. (2017). Variational fourier features for Gaussian processes. Journal of Machine Learning Research, 18(1), 5537–5588.

Hensman, J., Fusi, N., & Lawrence, N. D. (2013). Gaussian processes for big data. Conference on Uncertainty in Artificial Intelligence (UAI), 282–290.

Hensman, J., G. Matthews, A. G. de, Filippone, M., & Ghahramani, Z. (2015). MCMC for variationally sparse Gaussian processes. https://arxiv.org/abs/1506.04000

Hensman, J., Matthews, A. G. de G., & Filippone, M. (2018). Variational inference in Gaussian process models using the differential output training conditional. arXiv Preprint arXiv:1805.07109.

Herbster, M., & Warmuth, M. K. (1998). Tracking the best expert. Machine Learning, 32, 151.

Herdegen, M. (2008). The theorem of bahadur and rao and large portfolio losses. Journal of Applied Mathematics, 2011.

Herlocker, J., Konstan, J., Terveen, L., & Riedl, J. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1).

Hermes, L., Zöller, & Buhmann, J. (2002). Parametric distributional clustering for image segmentation. European Conference on Computer Vision.

Hernandez-Lobato, J. M., Li, Y., Rowland, M., Bui, T., Hernández-Lobato, D., & Turner, R. (2016). Black-box alpha divergence minimization. International Conference on Machine Learning, 1511–1520.

Hernández-Lobato, D., Hernández-Lobato, J., & Dupont, P. (2011). Robust multi-class Gaussian process classification. Advances in Neural Information Processing Systems, 24.

Hernández-Lobato, J. M., & Adams, R. (2015). Probabilistic backpropagation for scalable learning of Bayesian neural networks. International Conference on Machine Learning, 1861–1869.

Hernández-Muñoz, G., Villacampa-Calvo, C., & Hernández-Lobato, D. (2020). Deep Gaussian processes using expectation propagation and monte carlo methods. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 479–494.

Higgs, M., & Shawe-Taylor, J. (2010). A PAC-Bayes bound for tailored density estimation. Proceedings of the International Conference on Algorithmic Learning Theory (ALT).

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29, 82–97.

Hoch, T. (2015). An ensemble learning approach for the kaggle taxi travel time prediction challenge. Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge-Volume 1526, 52–62.

Hochreiter, S., & Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1), 1–42.

Hochstein, S., Barlasov, A., Hershler, O., Nitzan, A., & Shneor, S. (2004). Rapid vision is holistic. Journal of Vision, 4(5).

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 382–401.

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14, 1303–1347.

Hofmann, K., Bucher, P., Falquet, L., & Bairoch, A. (1999). The PROSITE database, its status in 1999. Nucleic Acids Research, 27(1), 215–219.

Hofmann, T. (1997). DATA CLUSTERING AND BEYOND: A deterministic annealing framework for exploratory data analysis. Shaker Verlag.

Hofmann, T. (1999a). Probabilistic latent semantic analysis. UAI-1999.

Hofmann, T. (1999b). Probabilistic latent semantic indexing. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Hofmann, T., Puzicha, J., & Buhmann, J. M. (1998). Unsupervised texture segmentation in a deterministic annealing framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 803–818.

Holland, M. (2019). PAC-Bayes under potentially heavy tails. Advances in Neural Information Processing Systems, 2711–2720.

Holmes, C. C., & Walker, S. G. (2015). Assigning a value to a power series: A Bayesian interpretation. Biometrika, 102, 497–501.

Honkela, A., & Valpola, H. (2003). On-line variational Bayesian learning. 4th International Symposium on Independent Component Analysis and Blind Signal Separation, 803–808.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

Hossein Pishro-Nik. (2018). Introduction to probability, statistics, and random processes. https://doi.org/doi:/10.25334/Q40H8J

Hu, W., Li, Z., & Yu, D. (2020). Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. International Conference on Learning Representations.

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. arXiv Preprint arXiv:1704.00109.

Hytönen, T., Van Neerven, J., Veraar, M., & Weis, L. (2018). Analysis in banach spaces: Volume II: Probabilistic methods and operator theory (Vol. 67). Springer.

Ibrahim, J. G., & Chen, M.-H. (2000). Power prior distributions for regression models. Statistical Science, 46–60.

Ibrahim, J. G., Chen, M.-H., & Sinha, D. (2003). On optimality properties of the power prior. Journal of the American Statistical Association, 98(461), 204–213.

Immer, A., Korzepa, M., & Bauer, M. (2021). Improving predictions of Bayesian neural nets via local linearization. International Conference on Artificial Intelligence and Statistics, 703–711.

Insua, D. R., & Ruggeri, F. (2012). Robust Bayesian analysis (Vol. 152). Springer Science & Business Media.

Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., & Wilson, A. G. (2020). Subspace inference for Bayesian deep learning. Uncertainty in Artificial Intelligence, 1169–1179.

Jaakkola, T. S. (2001). Tutorial on variational approximation methods. In M. Opper & D. Saad (Eds.), Advanced mean field methods: Theory and practice (pp. 129–160). MIT Press.

Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB).

Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31.

Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., & Kluger, Y. (2016). Unsupervised ensemble learning with dependent classifiers. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3).

Jain, S., Liu, G., Mueller, J., & Gifford, D. (2020). Maximizing overall diversity for improved uncertainty estimates in deep ensembles. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 4264–4271.

Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11.

Jankowiak, M., Pleiss, G., & Gardner, J. R. (2019). Parametric Gaussian process regressors. arXiv Preprint arXiv:1910.07123.

Janzing, D., & Schölkopf, B. (2009). Algorithmic Markov condition for probability-free causal inference. Proceedings of the Conference on Learning Theory (COLT).

Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106.

Jeanmougin, F., Thompson, J., Gouy, M., Higgins, D., & Gibson, T. (1998). Multiple sequence alignment with clustal X. Trends Biochem. Sci., 23, 403–405.

Jewson, J., Smith, J. Q., & Holmes, C. (2018). Principles of Bayesian inference using general divergence criteria. Entropy, 20(6), 442.

Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., & Bengio, S. (2020b). Fantastic generalization measures and where to find them. Proceedings of the 8th International Conference on Learning Representations (ICLR 2020).

Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., & Bengio, S. (2020a). Fantastic generalization measures and where to find them. International Conference on Learning Representations.

Jiang, Z., Liu, H., Fu, B., & Wu, Z. (2017). Generalized ambiguity decompositions for classification with applications in active learning and unsupervised ensemble pruning. Proceedings of the AAAI Conference on Artificial Intelligence, 31.

Jordan, M. I. (1999). An introduction to variational methods for graphical models. Springer.

Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical mixtures of experts and the EM algorithm (AIM-1440; p. 29).

K., M., Y., W., & M., J. (1999). Loopy belief propagation for approximate inference: An empirical study. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence.

Kaelbling, L. P. (1994). Associative reinforcement learning: Functions in k-DNF. Machine Learning, 15.

Kalchbrenner, N., & Blunsom, P. (2013). Recurrent continuous translation models. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1700–1709.

Kale, S. (2014). Multiarmed bandits with limited expert advice. Proceedings of the Conference on Learning Theory (COLT).

Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. Proceedings of the International Conference on Machine Learning (ICML).

Kárnỳ, M. (2014). Approximate Bayesian recursive estimation. Information Sciences, 285, 100–111.

Kathuria, T., Deshpande, A., & Singh, P. (2016). Batson–spielman–srivastava sparsification and detour ranking in determinantal point processes. Advances in Neural Information Processing Systems 29, 152–160.

Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An optimal finite time analysis. Proceedings of the International Conference on Algorithmic Learning Theory (ALT).

Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2022). Generalization in deep learning. In Mathematical aspects of deep learning. Cambridge University Press. https://doi.org/10.1017/9781009025096.003

Kearns, M. J., & Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11.

Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory. The MIT Press.

Kearns, M., Mansour, Y., Ng, A., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27.

Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 5574–5584.

Keren, D. (n.d.). Recognizing image style and activities in video using local features and naive bayes.

Keshet, J., McAllester, D., & Hazan, T. (2011). PAC-Bayesian approach for minimization of phoneme error rate. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D. (2005). Phoneme alignment based on discriminative learning. 9^th European Conference on Speech Communication and Technology (INTERSPEECH).

Keskar, N. S., Nocedal, J., Tang, P., Mudigere, D., & Smelyanskiy, M. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. Proceedings of ICLR 2017.

Khan, M. E. E., Immer, A., Abedi, E., & Korzepa, M. (2019). Approximate inference turns deep networks into Gaussian processes. Advances in Neural Information Processing Systems, 32, 3094–3104.

Kilbertus, N., Gomez-Rodriguez, M., Schölkopf, B., Muandet, K., & Valera, I. (2019). Improving consequential decision making under imperfect predictions.

Kim, Y.-D., & Choi, S. (2007). Nonnegative Tucker decomposition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference for Learning Representations.

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations.

Kleijn, B. J. K., Van der Vaart, A. W., et al. (2012). The Bernstein-von-Mises theorem under misspecification. Electronic Journal of Statistics, 6, 354–381.

Klenke, A. (2013). Probability theory: A comprehensive course (2nd ed.). Springer.

Kluger, Y., Basri, R., Chang, J. T., & Gerstein, M. (2003). Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research.

Knoblauch, J. (2019). Robust deep Gaussian processes. arXiv Preprint arXiv:1904.02303.

Knoblauch, J., Jewson, J., & Damoulas, T. (2019b). Generalized variational inference. arXiv Preprint arXiv:1904.02063.

Knoblauch, J., Jewson, J., & Damoulas, T. (2019a). Generalized variational inference. arXiv Preprint arXiv:1904.02063.

Knoblauch, J., Jewson, J., & Damoulas, T. (2022). An optimization-centric view on Bayes’ rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132), 1–109.

Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT press.

Kolmogorov, A. N. (1933a). Sulla determinazione empirica di una leggi di distribuzione. G. Inst. Ital. Attuari, 4.

Kolmogorov, A. N. (1965). Three approaches to the quantitative denition of information. Problems of Information and Transmission, 1, 1–7.

Kolmogorov, A. N. (1933b). Grundbegriffe der wahrscheinlichkeitsrechnung. Ergebnisse Der Mathematik.

Kolmogorov, A. N. (1956). Foundations of the theory of probability (2nd English ed.). Chelsea / Addison–Wesley.

Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory.

Komiyama, J., Honda, J., Kashima, H., & Nakagawa, H. (2015). Regret lower bound and optimal algorithm in dueling bandit problem. Proceedings of the Conference on Learning Theory (COLT).

Kontorovich, A., & Raginsky, M. (2017). Concentration of measure without independence: A unified approach via the martingale method. In Convexity and concentration. Springer, New York, NY.

Koolen, W. M., & Erven, T. van. (2015). Second-order quantile methods for experts and combinatorial games. Proceedings of the Conference on Learning Theory (COLT).

Koolen, W. M., Erven, T. van, & Grünwald, P. (2014). Learning the learning rate for prediction with expert advice. Advances in Neural Information Processing Systems (NeurIPS).

Krause, A., & Ong, C. S. (2011). Contextual Gaussian process bandit optimization. Advances in Neural Information Processing Systems (NeurIPS).

Krichevskiy, R. E. (1998). Laplace?s law of succession and universal encoding. IEEE Transactions on Information Theory, 44(1).

Krichevsky, R. E., & Trofimov, V. K. (1981). The performance of universal coding. IEEE Transactions on Information Theory, IT-27, 199–207.

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Toronto, ON, Canada.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105.

Krogh, A., & Hertz, J. (1991). A simple weight decay can improve generalization. Advances in Neural Information Processing Systems, 4.

Krogh, A., & Vedelsby, J. (1994). Neural network ensembles, cross validation and active learning. Proceedings of the 7th International Conference on Neural Information Processing Systems, NIPS’94, 231–238.

Krupka, E. (2008). Generalization from observed to unobserved features [PhD thesis]. The Hebrew University of Jerusalem.

Krupka, E., & Tishby, N. (2005). Generalization in clustering with unobserved features. Advances in Neural Information Processing Systems (NeurIPS).

Krupka, E., & Tishby, N. (2008). Generalization from observed to unobserved features by clustering. Journal of Machine Learning Research, 9.

Kullback, S., & Leibler, R. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22.

Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2), 181–207.

Kuo, H.-H. (2006). Gaussian measures in banach spaces. In Gaussian measures in banach spaces (pp. 1–109). Springer.

Kveton, B., Wen, Z., Ashkan, A., & Szepesvári, C. (2015). Tight regret bounds for stochastic combinatorial semi-bandits. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Lacasse, A., Laviolette, F., Marchand, M., Germain, P., & Usunier, N. (2007). PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. Advances in Neural Information Processing Systems (NeurIPS).

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the International Conference on Machine Learning (ICML).

Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6.

Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 6402–6413.

Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability based validation of clustering solutions. Neural Computation.

Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6.

Langford, J., & Seeger, M. (2001). Bounds for averaging classifiers. Citeseer.

Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. Advances in Neural Information Processing Systems (NeurIPS).

Langford, J., & Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. Advances in Neural Information Processing Systems (NeurIPS).

Lashkari, D., & Golland, P. (2009). Co-clustering with generative models. MIT-CSAIL-TR-2009-054.

Lauritzen, S. L. (1992). Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420), 1098–1108.

Lauritzen, S. L. (1996). Graphical models. Clarendon Press (Oxford University Press).

Laviolette, F., & Marchand, M. (2005). PAC-Bayes risk bounds for sample-compressed Gibbs classifiers. Proceedings of the International Conference on Machine Learning (ICML).

Laviolette, F., & Marchand, M. (2007). PAC-Bayes risk bounds for stochastic averages and majority votes of sample-compressed classifiers. Journal of Machine Learning Research, 8.

Laviolette, F., Marchand, M., & Roy, J.-F. (2011). From PAC-Bayes bounds to quadratic programs for majority votes. Proceedings of the International Conference on Machine Learning (ICML).

Laviolette, F., Morvant, E., Ralaivola, L., & Roy, J.-F. (2017). Risk upper bounds for general ensemble methods with an application to multiclass classification. Neurocomputing, 219.

Lawrence, N. D. (2001). Variational inference in probabilistic models [PhD thesis]. Citeseer.

Lawrence, N. D., & Moore, A. J. (2007). Hierarchical Gaussian process latent variable models. Proceedings of the 24th International Conference on Machine Learning, 481–488.

Lázaro-Gredilla, M. (2010a). Inter-domain Gaussian processes for sparse inference using inducing features. Advances in Neural Information Processing Systems 23, 1087–1095.

Lázaro-Gredilla, M. (2010b). Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11, 1865–1881.

Le, Q., Sarlós, T., & Smola, A. (2013). Fastfood—approximating kernel expansions in loglinear time. International Conference on Machine Learning, 244–252.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401.

Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems (NeurIPS).

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., & Sohl-Dickstein, J. (2018). Deep neural networks as Gaussian processes. 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

Lee, J., Feng, J., Humt, M., Müller, M. G., & Triebel, R. (2022). Trust your robots! Predictive uncertainty estimation of neural networks with sparse Gaussian processes. Conference on Robot Learning, 1168–1179.

Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. Advances in Neural Information Processing Systems (NeurIPS).

Leibig, C., Allken, V., Ayhan, M. S., Berens, P., & Wahl, S. (2017). Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports, 7, 1–14.

Letarte, G., Germain, P., Guedj, B., & Laviolette, F. (2019). Dichotomize and generalize: PAC-Bayesian binary activated deep neural networks. Advances in Neural Information Processing Systems, 6869–6879.

Letham, B., Rudin, C., McCormick, T. H., & Madigan, D. (2015). Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3), 1350–1371.

Lever, G., Laviolette, F., & Shawe-Taylor, J. (2010). Distribution-dependent PAC-Bayes priors. Proceedings of the International Conference on Algorithmic Learning Theory (ALT).

Lever, G., Laviolette, F., & Shawe-Taylor, J. (2013). Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science, 473.

Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation.

Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D., Kammel, S., Kolter, J. Z., Langer, D., Pink, O., Pratt, V., et al. (2011). Towards fully autonomous driving: Systems and algorithms. 2011 IEEE Intelligent Vehicles Symposium (IV), 163–168.

Li, H., & Abe, N. (1998). Word clustering and disambiguation based on co-occurrence data. Proceedings of the 17^th International Conference on Computational Linguistics.

Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2017). Visualizing the loss landscape of neural nets. arXiv Preprint arXiv:1712.09913.

Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the International Conference on World Wide Web (WWW).

Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the ACM International Conference on Web Search and Data Mining.

Li, X., & Orabona, F. (2019). On the convergence of stochastic gradient descent with adaptive stepsizes. The 22nd International Conference on Artificial Intelligence and Statistics, 983–992.

Li, Y., & Gal, Y. (2017). Dropout inference in Bayesian neural networks with alpha-divergences. International Conference on Machine Learning, 2052–2061.

Li, Y., Hernández-Lobato, J. M., & Turner, R. E. (2015). Stochastic expectation propagation. Advances in Neural Information Processing Systems, 28.

Li, Y., & Liu, Q. (2016). Wild variational approximations. NIPS Workshop on Advances in Approximate Bayesian Inference.

Li, Y., & Turner, R. E. (2016a). Rényi divergence variational inference. Advances in Neural Information Processing Systems, 29.

Li, Y., & Turner, R. E. (2016b). Rényi divergence variational inference. Advances in Neural Information Processing Systems 28, 1073–1081.

Liang, T., Poggio, T., Rakhlin, A., & Stokes, J. (2019). Fisher-rao metric, geometry, and complexity of neural networks. The 22nd International Conference on Artificial Intelligence and Statistics, 888–896.

Liao, J., & Berg, A. (2019). Sharpening Jensen’s inequality. The American Statistician, 73(3), 278–281.

Lifshits, M. (2012). Gaussian random functions. Springer.

Lin, J. A., Antorán, J., Padhy, S., Janz, D., Hernández-Lobato, J. M., & Terenin, A. (2024). Sampling from Gaussian process posteriors using stochastic gradient descent. Advances in Neural Information Processing Systems, 36.

Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108.

Littlewood, B., & Miller, D. R. (1989). Conceptual modeling of coincident failures in multiversion software. IEEE Transactions on Software Engineering, 15(12), 1596–1614.

Liu, J. Z., Padhy, S., Ren, J., Lin, Z., Wen, Y., Jerfel, G., Nado, Z., Snoek, J., Tran, D., & Lakshminarayanan, B. (2023). A simple approach to improve single-model deep uncertainty via distance-awareness. Journal of Machine Learning Research, 24, 1–63.

Liu, J., Paisley, J., Kioumourtzoglou, M.-A., & Coull, B. (2019). Accurate uncertainty estimation and decomposition in ensemble learning. Advances in Neural Information Processing Systems, 8950–8961.

Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in Neural Information Processing Systems, 2378–2386.

Liu, Y., & Yao, X. (1999). Ensemble learning via negative correlation. Neural Networks, 12(10), 1399–1404.

Livni, R., Shalev-Shwartz, S., & Shamir, O. (2014). On the computational efficiency of training neural networks. Advances in Neural Information Processing Systems, 27.

Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.

London, B., Huang, B., Taskar, B., & Getoor, L. (2014). PAC-Bayesian collective stability. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Lorenzen, S. S., Igel, C., & Seldin, Y. (2019a). On PAC-Bayesian bounds for random forests. Machine Learning, 108(8-9), 1503–1522.

Lorenzen, S. S., Igel, C., & Seldin, Y. (2019b). On PAC-Bayesian bounds for random forests. Machine Learning, 108(8-9).

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv Preprint arXiv:1711.05101.

Lu, Z., Wu, X., Zhu, X., & Bongard, J. (2010). Ensemble pruning via individual contribution ordering. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 871–880.

Luo, H., & Schapire, R. E. (2015). Achieving all with no parameters: AdaNormalHedge. Proceedings of the Conference on Learning Theory (COLT).

Luxburg, U. von, & Ben-David, S. (2005). Towards a statistical theory of clustering. PASCAL Workshop on Statistics and Optimization of Clustering.

Lyddon, S., Walker, S., & Holmes, C. C. (2018). Nonparametric learning from Bayesian models with randomized objective functions. Advances in Neural Information Processing Systems, 2071–2081.

Lykouris, T., Mirrokni, V., & Leme, R. P. (2018). Stochastic bandits robust to adversarial corruptions. Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing.

Lyle, C., Wilk, M. van der, & Kabán, A. (2020a). The benefits of invariance in neural networks. arXiv Preprint arXiv:2005.00178.

Lyle, C., Wilk, M. van der, Kwiatkowska, M., Gal, Y., & Bloem-Reddy, B. (2020b). On the benefits of invariance in neural networks. arXiv Preprint arXiv:2005.00178.

Ma, C., & Hernández-Lobato, J. M. (2021). Functional variational inference based on stochastic process generators. Advances in Neural Information Processing Systems, 34, 21795–21807.

Ma, C., Li, Y., & Hernández-Lobato, J. M. (2019). Variational implicit processes. International Conference on Machine Learning (ICML), 4222–4233.

Ma, S., Bassily, R., & Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. International Conference on Machine Learning, 3325–3334.

Ma, Y., & Deisenroth, M. P. (2019). A variational Bayesian treatment of implicit processes. Statistics and Computing, 29, 1145–1165.

MacKay, D. J. C. (1992a). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472.

MacKay, D. J. C. (1992b). Bayesian interpolation. Neural Computation, 4(3), 415–447. https://doi.org/10.1162/neco.1992.4.3.415

MacKay, D. J. C. (1992c). The evidence framework applied to classification networks. Neural Computation, 4(5), 720–736.

MacKay, D. J. C. (1995). Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3), 469.

MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge University Press.

Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., & Wilson, A. G. (2019). A simple baseline for Bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32, 13153–13164.

Maddox, W., Garipov, T., Izmailov, P., & Wilson, A. G. (2020). A simple baseline for Bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems 33, 13153–13164.

Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1).

Magureanu, S., Combes, R., & Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. Advances in Neural Information Processing Systems (NeurIPS).

Maillard, O.-A. (2011). Apprentissage séquentiel: Bandits, statistique et renforcement [PhD thesis]. INRIA Lille.

Maillard, O.-A., Munos, R., & Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. Proceedings of the Conference on Learning Theory (COLT).

Maintainers, T., & Contributors. (2016). TorchVision: PyTorch’s computer vision library. In GitHub repository. https://github.com/pytorch/vision; GitHub.

Mandal, M. K., Panchanathan, S., & Aboulnasr, T. (1995). Choice of wavelets for image compression. Information Theory and Applications, 239–249. citeseer.nj.nec.com/138157.html

Mandt, S., Hoffman, M. D., & Blei, D. M. (2017). Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(134), 1–35. https://jmlr.org/papers/v18/17-214.html

Mannor, S., & Shamir, O. (2011). From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems (NeurIPS).

Mannor, S., & Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5.

Mansour, Y., & McAllester, D. (2000). Generalization bounds for decision trees. Proceedings of the Conference on Learning Theory (COLT).

Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O., & Eisenberg, D. (1999). Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428), 751–753.

Martens, J. (2020). New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21, 5776–5851.

Martens, J., & Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. International Conference on Machine Learning, 2408–2417.

Marton, K. (1996). A measure concentration inequality for contracting Markov chains. Geometric and Functional Analysis, 6(3).

Marton, K. (1997). A measure concentration inequality for contracting Markov chains Erratum. Geometric and Functional Analysis, 7(3).

Masegosa, A. R., Martı́nez, A. M., Langseth, H., Nielsen, T. D., Salmerón, A., Ramos-López, D., & Madsen, A. L. (2016a). D-VMP: Distributed variational message passing. PGM’2016. JMLR: Workshop and Conference Proceedings, 52, 321–332.

Masegosa, A. R., Martinez, A. M., & Borchani, H. (2016b). Probabilistic graphical models on multi-core CPUs using Java 8. IEEE Computational Intelligence Magazine, 11(2), 41–54.

Masegosa, A. R. (2020). Learning under model misspecification: Applications to variational and ensemble methods. Advances in Neural Information Processing Systems.

Masegosa, A. R., Cabañas, R., Langseth, H., Nielsen, T. D., & Salmerón, A. (2019). Probabilistic models with deep neural networks. arXiv Preprint arXiv:1908.03442.

Masegosa, A. R., Lorenzen, S., Igel, C., & Seldin, Y. (2020). Second order PAC-Bayesian bounds for the weighted majority vote. Advances in Neural Information Processing Systems, 33, 5263–5273.

Masegosa, A. R., Martinez, A. M., Langseth, H., Nielsen, T. D., Salmerón, A., Ramos-López, D., & Madsen, A. L. (2017a). Scaling up Bayesian variational inference using distributed computing clusters. International Journal of Approximate Reasoning, 88, 435–451.

Masegosa, A. R., Martı́nez, A. M., Ramos-López, D., Cabañas, R., Salmerón, A., Nielsen, T. D., Langseth, H., & Madsen, A. L. (2017b). AMIDST: A Java toolbox for scalable probabilistic machine learning. arXiv Preprint arXiv:1704.01427.

Masegosa, A. R., Nielsen, T. D., Langseth, H., Ramos-López, D., Salmerón, A., & Madsen, A. L. (2017c). Bayesian models of data streams with hierarchical power priors. International Conference on Machine Learning, 2334–2343.

Matthews, A. G. de G., Hensman, J., Turner, R. E., & Ghahramani, Z. (2017a). On the convergence and robustness of sparse variational Gaussian process regression. Advances in Neural Information Processing Systems 30, 2394–2403.

Matthews, A. G. de G., Wilk, M. van der, Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., & Hensman, J. (2017b). GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40), 1–6.

Maurer, A. (2004). A note on the PAC-Bayesian theorem. www.arxiv.org.

Maurer, A., & Pontil, M. (2009). Empirical Bernstein bounds and sample variance penalization. Proceedings of the Conference on Learning Theory (COLT).

McAllester, D. (1999a). PAC-Bayesian model averaging. Proceedings of the Conference on Learning Theory (COLT).

McAllester, D. (1999b). Some PAC-Bayesian theorems. Machine Learning, 37.

McAllester, D. (2003a). PAC-Bayesian stochastic model selection. Machine Learning, 51.

McAllester, D. (2003b). Simplified PAC-Bayesian margin bounds. Proceedings of the Conference on Learning Theory (COLT).

McAllester, D. (2007). Generalization bounds and consistency for structured labeling. In G. Bakir, T. Hofmann, B. Schölkopf, A. Smola, B. Taskar, & S. V. N. Vishwanathan (Eds.), Predicting structured data. MIT Press.

McAllester, D. A. (1998). Some PAC-Bayesian theorems. Proceedings of the Conference on Learning Theory (COLT).

McAllester, D. A. (1999c). PAC-Bayesian model averaging. Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 164–170.

McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M., Shah, A., Cipolla, R., & Weller, A. (2017). Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning.

McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 148–188.

McInerney, J., Ranganath, R., & Blei, D. (2015). The population posterior and Bayesian modeling on streams. In Advances in neural information processing systems 28 (pp. 1153–1161). Curran Associates, Inc.

McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions.

McMahan, H. B., & Streeter, M. (2009). Tighter bounds for multi-armed bandits with expert advice. Proceedings of the Conference on Learning Theory (COLT).

Medasani, S., & Krishnapuram, R. (2001). Categorization of image databases for efficient retrieval using robust mixture decomposition. Computer Vision and Image Understanding: CVIU, 83(3), 216–235. citeseer.nj.nec.com/231614.html

Mei, S., Misiakiewicz, T., & Montanari, A. (2021). Learning with invariances in random features and kernel models. Conference on Learning Theory, 3351–3418.

Meila, M., & Jordan, M. I. (2000). Learning with mixtures of trees. Journal of Machine Learning Research, 1, 1–48.

Mercer, J. (1909a). Functions ofpositive and negativetypeand theircommection with the theory ofintegral equations. Philos. Trinsdictions Rogyal Soc, 209, 4–415.

Mercer, J. (1909b). Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, 209, 415–446.

Micchelli, C. A., & Pontil, M. (2006a). On learning vector-valued functions. Advances in Neural Information Processing Systems 17, 961–968.

Micchelli, C. A., & Pontil, M. (2006b). Universal kernels. Advances in Neural Information Processing Systems (NeurIPS), 18, 653–660.

Minka, T. P. (2000). Bayesian model averaging is not model combination. Available Electronically at Http://Www. Stat. Cmu. Edu/Minka/Papers/Bma. Html, 1–2.

Minka, T. P. (2005). Divergence measures and message passing (MSR-TR-2005-173; p. 17).

Minka, T. P. (2013). Expectation propagation for approximate Bayesian inference. arXiv Preprint arXiv:1301.2294.

Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference (MSR–TR–2001–41). Microsoft Research.

Minsker, S., Srivastava, S., Lin, L., & Dunson, D. B. (2017). Robust and scalable Bayes via a median of subset posterior measures. Journal of Machine Learning Research, 18(124), 1–40.

Mitchell, T. (1997). Machine learning. McGraw-Hill.

Mitzenmacher, M., & Upfal, E. (2005). Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. In NIPS deep learning workshop.

Mnih, V., Szepesvári, C., & Audibert, J.-Y. (2008). Empirical Bernstein stopping. Proceedings of the International Conference on Machine Learning (ICML).

Mohamed, S., Rosca, M., Figurnov, M., & Mnih, A. (2019). Monte Carlo gradient estimation in machine learning. arXiv Preprint arXiv:1906.10652.

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012). Foundations of machine learning. MIT Press.

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning. MIT press.

Mucsányi, B., Kirchhof, M., & Oh, S. J. (2024). Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks. ICML 2024 Workshop on Structured Probabilistic Inference & Generative ModelingL.

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.

Nabarro, S., Ganev, S., Garriga-Alonso, A., Fortuin, V., Wilk, M. van der, & Aitchison, L. (2022). Data augmentation in bayesian neural networks and the cold posterior effect. Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Nagarajan, V. (2021). Explaining generalization in deep learning: Progress and fundamental limits. arXiv Preprint arXiv:2110.08922.

Nagarajan, V., & Kolter, J. Z. (2017). Generalization in deep networks: The role of distance from initialization. NeurIPS Workshop on Deep Learning: Bridging Theory and Practice.

Nagarajan, V., & Kolter, J. Z. (2019a). Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience. International Conference on Learning Representations.

Nagarajan, V., & Kolter, J. Z. (2019b). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 11611–11622.

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2020). Deep double descent: Where bigger models and more data hurt. arXiv Preprint arXiv:1912.02292.

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.

Nalisnick, E., Hernández-Lobato, J. M., & Smyth, P. (2019). Dropout as a structured shrinkage prior. International Conference on Machine Learning, 4712–4722.

Namkoong, H., & Duchi, J. C. (2017). Variance-based regularization with convex objectives. Advances in Neural Information Processing Systems, 30.

Neal, R. M. (1993). Probabilistic inference using markov chain monte carlo methods (CRG–TR–93–1). Department of Computer Science, University of Toronto.

Neal, R. M. (1996). Bayesian learning for neural networks (Vol. 118). Springer Science & Business Media.

Neal, R. M. (2012). Bayesian learning for neural networks (Vol. 118). Springer Science & Business Media.

Negrea, J., Dziugaite, G. K., & Roy, D. (2020). In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors. International Conference on Machine Learning, 7263–7272.

Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course. Springer.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. http://ufldl.stanford.edu/housenumbers

Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems (NeurIPS).

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231.

Neyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro, N. (2017a). Exploring generalization in deep learning. Advances in Neural Information Processing Systems, 30.

Neyshabur, B., Bhojanapalli, S., & Srebro, N. (2017b). A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. International Conference on Learning Representations.

Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., & Srebro, N. (2019). The role of over-parametrization in generalization of neural networks. International Conference on Learning Representations. https://openreview.net/forum?id=BygfghAcYX

Neyshabur, B., Salakhutdinov, R. R., & Srebro, N. (2015a). Path-sgd: Path-normalized optimization in deep neural networks. Advances in Neural Information Processing Systems, 28.

Neyshabur, B., Tomioka, R., & Srebro, N. (2015b). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv Preprint arXiv:1412.6614.

Neyshabur, B., Tomioka, R., & Srebro, N. (2015c). Norm-based capacity control in neural networks. Proceedings of the 28th International Conference on Learning Theory (COLT), 1376–1401.

Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems (NeurIPS).

Nitzan, S., & Paroush, J. (1982). Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, 23(2).

Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed.). Springer.

Norgeot, B., Glicksberg, B. S., & Butte, A. J. (2019). A call for deep-learning healthcare. Nature Medicine, 25(1), 14–15.

Novak, R., Sohl-Dickstein, J., & Schoenholz, S. S. (2022). Fast finite width neural tangent kernel. International Conference on Machine Learning, 17018–17044.

Novikov, A., & Izmailov, P. (2018). Tensor train kernel trick. Neural Networks, 104, 1–19.

Olesen, K. G., Lauritzen, S. L., & Jensen, F. V. (1992). AHUGIN: A system creating adaptive causal probabilistic networks. Proceedings of the Eighth International Conference on Uncertainty in Artificial Intelligence, 223–229.

Opper, M., & Archambeau, C. (2009). The variational Gaussian approximation revisited. Neural Computation, 21(3), 786–792. https://doi.org/10.1162/neco.2008.06-08-804

Ortega, L. A., Rodriguez-Santana, S., & Hernández-Lobato, D. (2024a). Variational linearized Laplace approximation for Bayesian deep learning. International Conference on Machine Learning, 38815–38836.

Ortega, L. A., Rodriguez-Santana, S., & Hernández-Lobato, D. (2023). Deep variational implicit processes. International Conference of Learning Representations.

Ortega, L. A., Rodrı́guez-Santana, S., & Hernández-Lobato, D. (2024b). Fixed-mean Gaussian processes for post-hoc Bayesian deep learning. arXiv Preprint arXiv:2412.04177.

Ortner, R. (2013). Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Annals of Operations Research.

Osawa, K., Swaroop, S., Khan, M. E. E., Jain, A., Eschenhagen, R., Turner, R. E., & Yokota, R. (2019). Practical deep learning with Bayesian principles. Advances in Neural Information Processing Systems, 4289–4301.

Ottucsák, G., & György, A. (2006). The combination of the label efficient and the multi-armed bandit problem in adversarial setting. http://citeseerx.ist.psu.edu/viewdoc/versions?doi=10.1.1.126.1228.

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift? Advances in Neural Information Processing Systems, 13969–13980.

Ozkan, E., Smidl, V., Saha, S., Lundquist, C., & Gustafsson, F. (2013). Marginalized adaptive particle filtering for nonlinear models with unknown time-varying noise parameters. Automatica, 49(6), 1566–1575.

P., G. A., T., S. P., M., K. C., O., C.-H., B., E. M., G., S., D., B., & O., B. P. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology. Cell, 11(12), 4241–4257.

Pang, T., Xu, K., Du, C., Chen, N., & Zhu, J. (2019). Improving adversarial robustness via promoting ensemble diversity. International Conference on Machine Learning, 4970–4979.

Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation.

Paninski, L. (2004). Variational minimax estimation of discrete distributions under KL loss. Advances in Neural Information Processing Systems (NeurIPS).

Papadimitriou, S., Sun, J., & Faloutsos, C. (2005). Streaming pattern discovery in multiple time-series. Proceedings of the 31st International Conference on Very Large Data Bases, 697–708.

Patel, A. B., Nguyen, M. T., & Baraniuk, R. (2016). A probabilistic framework for deep learning. Advances in Neural Information Processing Systems, 29.

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufman Publishers.

Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., & Szepesvári, C. (2020). Tighter risk certificates for neural networks. arXiv Preprint arXiv:2007.12911.

Perrone, V., Jenkins, P. A., Spano, D., & Teh, Y. W. (2017). Poisson random fields for dynamic feature models. Journal of Machine Learning Research, 18(127), 1–45.

Petersen, P., & Voigtlaender, F. (2020). Equivalence of approximation by convolutional neural networks and fully-connected networks. Proceedings of the American Mathematical Society, 148(4), 1567–1581.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., & Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos. Advances in Neural Information Processing Systems, 29.

Popper, K. (1934). Logik der forschung.

Puurula, A., Read, J., & Bifet, A. (2014). Kaggle LSHTC4 winning solution. arXiv Preprint arXiv:1405.0546.

Quiñonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6, 1939–1959.

Rabiner, L. R. (1989). Tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.

Rabiner, L. R., & Juang, B.-H. (1986). An introduction to hidden markov models. Ieee Assp Magazine, 3(1), 4–16.

Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. Advances in Neural Information Processing Systems 20, 1177–1184.

Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic PAC-Bayes bounds for non-IID data: Applications to ranking and stationary β-mixing processes. Journal of Machine Learning Research.

Ramanan, A. K. A. K. (2008). Concentration inequalities for dependent random variables via the martingale method. Annals of Probability, 36(6).

Rasmussen, C. E. (2003). Gaussian processes in machine learning. Summer School on Machine Learning, 63–71.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. MIT Press.

Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv Preprint arXiv:1904.09237.

Reed, M., & Simon, B. (1980). Methods of modern mathematical physics. Vol. I: Functional analysis. Academic Press.

Reed, R., Oh, S., Marks, R., et al. (1992). Regularization using jittered training data. International Joint Conference on Neural Networks, 3, 147–152.

Rényi, A. (1961). On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 4, 547–562.

Rezende, D. J., & Mohamed, S. (2015). Variational inference with normalizing flows. arXiv Preprint arXiv:1505.05770.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471.

Ritter, H., Botev, A., & Barber, D. (2018). A scalable Laplace approximation for neural networks. International Conference on Learning Representations, 6.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society.

Roca, J. (1995). The mechanisms of DNA topoisomerases. Trends in Biol. Chem., 20, 156–160.

Rockafellar, R. T. (1970). Convex analysis. Princeton University Press. https://doi.org/doi:10.1515/9781400873173

Rodriguez-Santana, S., & Hernández-Lobato, D. (2022). Adversarial α-divergence minimization for Bayesian approximate inference. Neurocomputing, 513, 410–421. https://doi.org/10.1016/j.neucom.2022.10.052

Rodrı́guez Santana, S., Zaldivar, B., & Hernández-Lobato, D. (2021). Sparse implicit processes for approximate inference. arXiv e-Prints, arXiv–2110.

Rodrı́guez-Santana, S., & Hernández-Lobato, D. (2022). Adversarial α-divergence minimization for Bayesian approximate inference. Neurocomputing, 471, 260–274.

Rohwer, R., & Freitag, D. (2004). Towards full automation of lexicon construction. In D. Moldovan & R. Girju (Eds.), HLT-NAACL 2004: Workshop on computational lexical semantics.

Roli, F., Giacinto, G., & Vernazza, G. (2001). Methods for designing multiple classifier systems. International Workshop on Multiple Classifier Systems, 78–87.

Ron, D., Singer, Y., & Tishby, N. (1995). On the learnability and usage of acyclic probabilistic finite automata. Proc. 8th Annu. Conf. On Comput. Learning Theory, 31–40.

Ron, D., Singer, Y., & Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25, 117–149.

Rooij, S. de, Erven, T. van, Grünwald, P. D., & Koolen, W. M. (2014). Follow the leader if you can, hedge if you must. Journal of Machine Learning Research.

Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression and related optimization problems. IEEE Transactions on Information Theory, 80, 2210–2239.

Ross, A., & Doshi-Velez, F. (2018). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. Proceedings of the AAAI Conference on Artificial Intelligence, 32.

Rubin, D., & Stein, M. (2016). Spatially adaptive Bayesian covariance tapering. International Conference on Artificial Intelligence and Statistics (AISTATS), 650–658.

Ruddigkeit, L., Van Deursen, R., Blum, L. C., & Reymond, J.-L. (2012). Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling, 52(11), 2864–2875.

Rudin, W. (1991). Functional analysis (2nd ed.). McGraw–Hill.

Rui, Y., Huang, T., & Chang, S. (1999). Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10(4), 39–62.

Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parametrized bandits. Mathematics of Operations Research, 35.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y

Sabato, S., & Shalev-Shwartz, S. (2007). Prediction by categorical features: Generalization properties and application to feature ranking. Proceedings of the Conference on Learning Theory (COLT).

Sabato, S., & Shalev-Shwartz, S. (2008). Ranking categorical features using generalization properties. Journal of Machine Learning Research, 9.

Sajda, P. (2006). Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng., 8, 537–565.

Salakhutdinov, R., & Mnih, A. (2008). Bayesian probabilistic matrix factorization using Markov chain monte carlo. Proceedings of the International Conference on Machine Learning (ICML).

Salimbeni, H., & Deisenroth, M. (2017a). Doubly stochastic variational inference for deep Gaussian processes. Advances in Neural Information Processing Systems, 30.

Salimbeni, H., & Deisenroth, M. P. (2017b). Doubly stochastic variational inference for deep Gaussian processes. Advances in Neural Information Processing Systems 30, 4588–4599.

Samson, P.-M. (2000). Concentration of measure inequalities for markov chains and Φ-mixing processes. The Annals of Probability, 28(1).

Sannai, A., Polyanskiy, Y., & Watanabe, Y. (2019). Strong data processing inequalities and Φ-sobolev inequalities for discrete channels. 2019 IEEE International Symposium on Information Theory (ISIT), 447–451.

Santana, S. R., Zaldivar, B., & Hernández-Lobato, D. (2021). Sparse implicit processes for approximate inference. arXiv Preprint arXiv:2110.07618.

Sato, M. A. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7), 1649–1681.

Saul, L. K., & Jordan, M. I. (1999). Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones. Machine Learning, 37(1), 75–87.

Scannell, A., Mereu, R., Chang, P., Tamir, E., Pajarinen, J., & Solin, A. (2024). Function-space parameterization of neural networks for sequential learning. International Conference on Learning Representations.

Schaeffer, S. E. (2007). Graph clustering. Computer Science Review.

Schoenholz, S. S., Gilmer, J., Ganguli, S., & Sohl-Dickstein, J. (2017). Deep information propagation. International Conference on Learning Representations. https://openreview.net/forum?id=H1W1UN9gg

Schölkopf, B., & Smola, A. (2002). Learning with kernels. Support vector machines, regularization, optimization and beyond. MIT Press.

Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269.

Seeger, M. (2003a). Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. University of Edinburgh.

Seeger, M. (2003b). Bayesian Gaussian process models: PAC-Bayesian generalization error bounds and sparse approximations [PhD thesis]. University of Edinburgh.

Seeger, M. (2003c). Fast forward selection to speed up sparse Gaussian process regression. Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.

Segal, E., Pe’er, D., Regev, A., Koller, D., & Friedman, N. (2005). Learning module networks. Journal of Machine Learning Research.

Seldin, Y. (2001). On unsupervised learning of mixtures of Markovian sources [Master’s thesis]. The Hebrew University of Jerusalem.

Seldin, Y. (2005a). 3D-3R: 3D Content Rating, Ranking, and Recording (White Paper IA-R611). NDS Technologies Ltd.

Seldin, Y. (2005b). Personalized navigation in digital TV world. Content filtering based on personalized ratings [Unpublished manuscript].

Seldin, Y. (2009). A PAC-Bayesian approach to structure learning [PhD thesis]. The Hebrew University of Jerusalem.

Seldin, Y. (2010). A PAC-Bayesian analysis of graph clustering and pairwise clustering. http://arxiv.org/abs/1009.0499.

Seldin, Y. (2015). The space of online learning problems. ECML-PKDD Tutorial. https://sites.google.com/site/spaceofonlinelearningproblems/.

Seldin, Y., Auer, P., Abbasi-Yadkori, Y., & Szepesvári, C. (2012a). Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments. Proceedings of the European Workshop on Reinforcement Learning (EWRL).

Seldin, Y., Auer, P., Laviolette, F., Shawe-Taylor, J., & Ortner, R. (2011a). PAC-Bayesian analysis of contextual bandits. Advances in Neural Information Processing Systems (NeurIPS).

Seldin, Y., Bartlett, P. L., & Crammer, K. (2013a). Advice-efficient prediction with expert advice. http://arxiv.org/abs/1304.3708.

Seldin, Y., Bartlett, P. L., Crammer, K., & Abbasi-Yadkori, Y. (2014). Prediction with limited advice and multiarmed bandits with paid observations. Proceedings of the International Conference on Machine Learning (ICML).

Seldin, Y., Bejerano, G., & Tishby, N. (2001a). Unsupervised segmentation and classification of mixtures of Markovian sources. Proceedings of the 33^rd Symposium on the Interface of Computing Science and Statistics.

Seldin, Y., Bejerano, G., & Tishby, N. (2001b). Unsupervised sequence segmentation by a mixture of switching variable memory Markov sources. Proceedings of the International Conference on Machine Learning (ICML).

Seldin, Y., Cesa-Bianchi, N., Auer, P., Laviolette, F., & Shawe-Taylor, J. (2012b). PAC-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. "Journal of Machine Learning Research"workshop and Conference Proceedings, 26.

Seldin, Y., Cesa-Bianchi, N., Laviolette, F., Auer, P., Shawe-Taylor, J., & Peters, J. (2011b). PAC-Bayesian analysis of the exploration-exploitation trade-off. Online Trading of Exploration and Exploitation 2, ICML Workshop.

Seldin, Y., Crammer, K., & Bartlett, P. L. (2013b). Open problem: Adversarial multiarmed bandits with limited advice. Proceedings of the Conference on Learning Theory (COLT).

Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., & Auer, P. (2012c). PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58.

Seldin, Y., Laviolette, F., Shawe-Taylor, J., Peters, J., & Auer, P. (2011c). PAC-Bayesian analysis of martingales and multiarmed bandits. http://arxiv.org/abs/1105.2416.

Seldin, Y., & Lugosi, G. (2016). A lower bound for multi-armed bandits with expert advice. Proceedings of the European Workshop on Reinforcement Learning (EWRL).

Seldin, Y., & Lugosi, G. (2017). An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. Proceedings of the Conference on Learning Theory (COLT).

Seldin, Y., & Schölkopf, B. (2013). On the relations and differences between Popper dimension, exclusion dimension and VC-dimension. In B. Schölkopf, Z. Luo, & V. Vovk (Eds.), Empirical inference – festshrift in honor of vladimir n. vapnik. Springer.

Seldin, Y., & Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. Proceedings of the International Conference on Machine Learning (ICML).

Seldin, Y., Slonim, N., & Tishby, N. (2007). Information bottleneck for non co-occurrence data. Advances in Neural Information Processing Systems (NeurIPS).

Seldin, Y., Starik, S., & Werman, M. (2003). Unsupervised clustering of images using their joint segmentation. The 3^rd International Workshop on Statistical and Computational Theories of Vision (SCTV).

Seldin, Y., Szepesvári, C., Auer, P., & Abbasi-Yadkori, Y. (2013c). Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments. "Journal of Machine Learning Research"workshop and Conference Proceedings, 24 (EWRL).

Seldin, Y., & Tishby, N. (2008). Multi-classification by categorical features via clustering. Proceedings of the International Conference on Machine Learning (ICML).

Seldin, Y., & Tishby, N. (2009a). PAC-Bayesian generalization bound for density estimation with application to co-clustering. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Seldin, Y., & Tishby, N. (2009b). PAC-Bayesian generalization bound for density estimation with application to co-clustering. Artificial Intelligence and Statistics, 472–479.

Seldin, Y., & Tishby, N. (2010). PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11.

Shafiei, M. M., & Milios, E. E. (2006). Model-based overlapping co-clustering. Proceeding of SIAM Conference on Data Mining.

Shalaeva, V., Esfahani, A. F., Germain, P., & Petreczky, M. (2019). Improved PAC-Bayesian bounds for linear regression. arXiv Preprint arXiv:1912.03036.

Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2).

Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.

Shalev-Shwartz, S., & Birnbaum, A. (2012). Learning halfspaces with the zero-one loss: Time-accuracy tradeoffs. Advances in Neural Information Processing Systems (NeurIPS).

Shalev-Shwartz, S., Shamir, O., & Tromer, E. (2012). Using more data to speed-up training time. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Shalev-Shwartz, S., & Srebro, N. (2008). SVM optimization: Inverse dependence on training set size. Proceedings of the International Conference on Machine Learning (ICML).

Shamir, O., Sabato, S., & Tishby, N. (2008). Learning and generalization with the information bottleneck. Proceeding of the International Symposium on AI and Mathematics (ISAIM).

Shamir, O., & Tishby, N. (2008a). Cluster stability for finite samples. Advances in Neural Information Processing Systems (NeurIPS).

Shamir, O., & Tishby, N. (2008b). Model selection and stability in k-means clustering. Proceedings of the Conference on Learning Theory (COLT).

Shamir, O., & Tishby, N. (2009). On the reliability of clustering stability in the large sample regime. Advances in Neural Information Processing Systems (NeurIPS).

Shan, H., & Banerjee, A. (2008). Bayesian co-clustering. IEEE International Conference on Data Mining (ICDM).

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.

Shanon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. Journal, 27, 379–423, 623–656.

Shashua, A., Zass, R., & Hazan, T. (2006). Multi-way clustering using super-symmetric non-negative tensor factorization. European Conference on Computer Vision (ECCV).

Shawe-Taylor, J., Archambeau, C., Higgs, M., & Opper, M. (2009). PAC-Bayes analysis of Bayesian inference.

Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998a). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5).

Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998b). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.

Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.

Shawe-Taylor, J., Cristianini, N., et al. (2004). Kernel methods for pattern analysis. Cambridge university press.

Shawe-Taylor, J., & Dolia, A. (2007). A framework for probability density estimation. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Shawe-Taylor, J., & Hardoon, D. (2009). PAC-bayes analysis of maximum entropy classification. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Shawe-Taylor, J., & Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. Proceedings of the Conference on Learning Theory (COLT).

Shen, R., Bubeck, S., & Gunasekar, S. (2022). Data augmentation as feature manipulation: A story of desert cows and grass cows. arXiv Preprint arXiv:2203.01572.

Sheth, R., & Khardon, R. (2020). Pseudo-Bayesian learning via direct loss minimization with applications to sparse Gaussian process models. Symposium on Advances in Approximate Bayesian Inference, 1–18.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8).

Shi, J., Sun, S., & Zhu, J. (2018). A spectral approach to gradient estimation for implicit distributions. International Conference on Machine Learning, 4644–4653.

Shi, T., & Zhu, J. (2014). Online Bayesian passive-aggressive learning. Proceedings of the International Conference on Machine Learning (ICML), 378–386.

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48.

Silva, P. R. da. (2006). An introduction to measure theory. Springer.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. van den, Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G. van den, Graepel, T., & Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550.

Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. ICDAR, 3, 958–962.

Simon, B. (2005). Trace ideals and their applications. American Mathematical Soc.

Singer, Y. (1997). Adaptive mixtures of probabilistic transducers. NeuroComputing, 9(8), 1711–1733.

Singer, Y., & Tishby, N. (1993). Decoding cursive scripts. Advances in Neural Information Processing Systems (NeurIPS).

Singh, P. N. (2021). Better application of Bayesian deep learning to diagnose disease. 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), 928–934.

S.Krishnamachari, & M.Abdel-Mottaleb. (1999). Hierarchical clustering algorithm for fast image retrieval. IS&t/SPIE Conference on Storage and Retrieval for Image and Video Databases VII, 427–435.

Slonim, N. (2002). The information bottleneck: Theory and applications [PhD thesis]. The Hebrew University of Jerusalem.

Slonim, N., Atwal, G. S., Tracik, G., & Bialek, W. (2005). Information-based clustering. Proceedings of the National Academy of Science, 102(51).

Slonim, N., Fine, S., & Tishby, N. (2001, January). Desciminative variable memory markov model for feature selection. Submitted to ICML 2001.

Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Slonim, N., Friedman, N., & Tishby, N. (2006). Multivariate information bottleneck. Neural Computation, 18.

Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Slonim, N., & Weiss, Y. (2002). Maximum likelihood and the information bottleneck. Advances in Neural Information Processing Systems (NeurIPS).

Smith, S. L., & Le, Q. V. (2017). A Bayesian perspective on generalization and stochastic gradient descent. https://arxiv.org/abs/1710.06451

Smola, A. J., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. Proceedings of the 17th International Conference on Machine Learning, 911–918.

Smolkin, M., & Ghosh, D. (2003). Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics, 36(4).

Snelson, E., & Ghahramani, Z. (2006). Sparse Gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems 18, 1257–1264.

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems 25, 2951–2959.

Snoek, J., Ovadia, Y., Fertig, E., Lakshminarayanan, B., Nowozin, S., Sculley, D., Dillon, J., Ren, J., & Nado, Z. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems, 13969–13980.

Sokolic, J., Giryes, R., Sapiro, G., & Rodrigues, M. (2017). Generalization error of invariant classifiers. Artificial Intelligence and Statistics, 1094–1103.

Solomonoff, R. J. (1960). A preliminary report on a general theory of inductive inference. Zator Company, Cambrige, MA.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., & Srebro, N. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1), 2822–2878.

Sprinzak, E. (2004). Studying interacting proteins by computational approaches [PhD thesis]. The Hebrew University of Jerusalem.

Srebro, N. (2004). Learning with matrix factorizations [PhD thesis]. MIT.

Srebro, N., Alon, N., & Jaakkola, T. S. (2005a). Generalization error bounds for collaborative prediction with low-rank matrices. Advances in Neural Information Processing Systems (NeurIPS).

Srebro, N., Rennie, J., & Jaakkola, T. (2005b). Maximum margin matrix factorization. Advances in Neural Information Processing Systems (NeurIPS).

Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. http://arxiv.org/abs/0912.3995.

Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. Proceedings of the International Conference on Machine Learning (ICML).

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research.

Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. Computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323–332.

Stein, M. L. (1999). Interpolation of spatial data: Some theory for kriging. Springer.

Steinwart, I., & Christmann, A. (2008). Support vector machines.

Steyvers, M., & Griffiths, T. (2006). Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Latent semantic analysis: A road to meaning. Laurence Erlbaum.

Stoltz, G. (2005). Incomplete information and internal regret in prediction of individual sequences [PhD thesis]. Université Paris-Sud.

Strang, G. (2009). Introduction to linear algebra (4^th). Wellesley-Cambridge Press.

Strehl, A. L., Li, L., & Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research.

Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (2006). Experience-efficient learning in associative bandit problems. Proceedings of the International Conference on Machine Learning (ICML).

Stroock, D. W. (2010). Probability theory: An analytic view. Cambridge university press.

Stuart, E. T., Kioussi, C., & Gruss, P. (1994). Mammalian Pax genes. Annu. Rev. Genet., 28, 219–236.

Subramanian, V., Arya, R., & Sahai, A. (2022). Generalization for multiclass classification with overparameterized linear models. Advances in Neural Information Processing Systems, 35, 23479–23494.

Sun, S., Zhang, G., Shi, J., & Grosse, R. (2019). Functional variational Bayesian neural networks. International Conference on Learning Representations.

Sutskever, I., & Hinton, G. E. (2008). Deep, narrow sigmoid belief networks are universal approximators. Neural Computation, 20(11), 2629–2636.

Sutskever, I., Salakhutdinov, R., & Tenenbaum, J. B. (2009). Modelling relational data using Bayesian clustered tensor factorization. Advances in Neural Information Processing Systems (NeurIPS).

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv:1312.6199.

Takamura, H., & Matsumoto, Y. (2003). Co-clustering for text categorization. Information Processing Society of Japan Journal.

Tang, E. K., Suganthan, P. N., & Yao, X. (2006). An analysis of diversity measures. Machine Learning, 65(1), 247–271.

Tang, L., Hanka, R., Ip, H., Cheung, K., & Lam, R. (2000). Semantic query processing and annotation generation for content-based retrieval of histological images. SPIE EMedical Imagingg 2000, Document Recognition and Retrieval IX.

Taskar, B., Abbeel, P., Wong, M.-F., & Koller, D. (2007). Relational markov networks. In L. Getoor & B. Taskar (Eds.), Introduction to statistical relational learning. MIT Press.

Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. Advances in Neural Information Processing Systems (NeurIPS).

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2004). Hierarchical dirichlet processes (No. 653). Department of Statistics, University of California, Berkeley.

Teh, Y. W., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101.

Thiemann, N. (2016). PAC-Bayesian ensemble learning [Master’s thesis]. University of Copenhagen.

Thiemann, N., Igel, C., & Seldin, Y. (2016). PAC-Bayesian aggregation without cross-validation. http://arxiv.org/abs/1608.05610.

Thiemann, N., Igel, C., Wintenberger, O., & Seldin, Y. (2017a). A strongly quasiconvex PAC-Bayesian bound. Proceedings of the International Conference on Algorithmic Learning Theory (ALT).

Thiemann, N., Igel, C., Wintenberger, O., & Seldin, Y. (2017b). A strongly quasiconvex PAC-Bayesian bound. International Conference on Algorithmic Learning Theory, 466–492.

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25.

Thune, T. S., & Seldin, Y. (2018). Adaptation to easy data in prediction with limited advice. Advances in Neural Information Processing Systems (NeurIPS).

Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.

Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. Allerton Conference on Communication, Control and Computation.

Tishby, N., & Polani, D. (2010). Information theory of decisions and actions. In V. Cutsuridis, A. Hussain, J. G. Taylor, & D. Polani (Eds.), Perception-reason-action cycle: Models, algorithms and systems. Springer.

Tishby, N., & Slonim, N. (2000). Data clustering by markovian relaxation and the information bottleneck method. Advances in Neural Information Processing Systems (NeurIPS).

Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. 2015 Ieee Information Theory Workshop (Itw), 1–5.

Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. Artificial Intelligence and Statistics, 567–574.

Titsias, M., & Lawrence, N. D. (2010). Bayesian Gaussian process latent variable model. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 844–851.

Tolstikhin, I. O., & Seldin, Y. (2013a). PAC-Bayes-Empirical-Bernstein inequality. Advances in Neural Information Processing Systems, 109–117.

Tolstikhin, I., & Seldin, Y. (2013b). PAC-Bayes-Empirical-Bernstein inequality. Advances in Neural Information Processing Systems (NeurIPS).

Touchette, H. (2009). The large deviation approach to statistical mechanics. Physics Reports, 478(1-3), 1–69.

Tran-Thanh, L., Stavrogiannis, L., Naroditskiy, V., Robu, V., Jennings, N. R., & Key, P. (2014). Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Triantafyllopoulos, K. (2009). Inference of dynamic generalized linear models: On-line computation and appraisal. International Statistical Review, 77(3), 430–450.

Vakhania, N. N., Tarieladze, V. I., & Chobanyan, S. A. (1987). Probability distributions on banach spaces. Reidel.

Valentini, G., & Dietterich, T. G. (2003). Low bias bagged support vector machines. Proceedings of the International Conference on Machine Learning (ICML).

Valiant, L. G. (1984). A theory of the learnable. Communications of the Association for Computing Machinery, 27.

Van Neerven, J. et al. (2010). γ-radonifying operators—a survey. The AMSI-ANU Workshop on Spectral Theory and Harmonic Analysis, 44, 1–61.

Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in Neural Information Processing Systems, 831–838.

Vapnik, V. N. (1995). The nature of statistical learning theory. Springer-Verlag New York, Inc.

Vapnik, V. N. (1998b). Statistical learning theory. Wiley.

Vapnik, V. N. (1998a). Statistical learning theory. John Wiley & Sons.

Vapnik, V. N. (1998c). Statistical learning theory. Wiley.

Vapnik, V. N., & Chervonenkis, A. Y. (2015). On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity: Festschrift for alexey chervonenkis (pp. 11–30). Springer.

Vapnik, V. N., & Chervonenkis, A. Ya. (1968). On the uniform convergence of relative frequencies of events to their probabilities. Soviet Math. Dokl., 9.

Vapnik, V. N., & Chervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2).

Vapnik, V. N., & Chervonenkis, A. Ya. (1974). Theory of pattern recognition. Nauka, Moscow (in Russian).

Vapnik, V. N., & Chervonenkis, A. Ya. (1981). Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and Its Applications, 26(3), 532–553.

Varga, D., Csiszárik, A., & Zombori, Z. (2017). Gradient regularization improves accuracy of discriminative models. arXiv Preprint arXiv:1712.09936.

Varshney, K. R., & Alemzadeh, H. (2017). On the safety of machine learning: Cyber-physical systems, decision sciences, and data products. Big Data, 5(3), 246–255.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you need. Advances in Neural Information Processing Systems, 5998–6008.

Vidal, R., Bruna, J., Giryes, R., & Soatto, S. (2020). Mathematics of deep learning. Mexican Conference on Pattern Recognition.

Villacampa-Calvo, C., & Hernández-Lobato, D. (2020). Alpha divergence minimization in multi-class Gaussian process classification. Neurocomputing, 378, 210–227.

Virmaux, A., & Scaman, K. (2018). Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Advances in Neural Information Processing Systems, 31.

Vovk, V. (1990). Aggregating strategies. Proceedings of the Conference on Learning Theory (COLT).

Wainwright, M. J., Jordan, M. I., et al. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1–305.

Walker, S. G. (2013). Bayesian inference with misspecified models. Journal of Statistical Planning and Inference, 143(10), 1621–1633.

Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. The Computer Journal, 11(2), 185–195.

Wang, C., & Blei, D. M. (2018). A general method for robust Bayesian modeling. Bayesian Analysis, 13(4), 1163–1191.

Wang, J. Z., Wiederhold, G., Firschein, O., & Wei, S. X. (1997). Content-based image indexing and searching using daubechies’ wavelets. Int. J. On Digital Libraries, 1(4), 311–328. citeseer.nj.nec.com/wang98contentbased.html

Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1290–1297.

Wang, P., Domeniconi, C., & Laskey, K. B. (2009). Latent Dirichlet Bayesian co-clustering. Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD).

Wang, Y., & Blei, D. (2019). Variational Bayes under model misspecification. Advances in Neural Information Processing Systems, 13357–13367.

Wang, Y., Kucukelbir, A., & Blei, D. M. (2017). Robust probabilistic modeling with Bayesian data reweighting. International Conference on Machine Learningd, 3646–3655.

Wang, Y., Sonthalia, R., & Hu, W. (2024). Near-interpolators: Rapid norm growth and the trade-off between interpolation and generalization. International Conference on Artificial Intelligence and Statistics, 4483–4491.

Wei, C.-Y., & Luo, H. (2018). More adaptive algorithms for adversarial bandits. Proceedings of the Conference on Learning Theory (COLT).

Wei, Y., Sheth, R., & Khardon, R. (2020). Direct loss minimization for sparse Gaussian processes. arXiv Preprint arXiv:2004.03083.

Wen, Y., Tran, D., & Ba, J. (2019). BatchEnsemble: An alternative approach to efficient ensemble and lifelong learning. International Conference on Learning Representations.

Wen, Y., Vicol, P., Ba, J., Tran, D., & Grosse, R. (2018). Flipout: Efficient pseudo-independent weight perturbations on mini-batches. International Conference on Learning Representations. https://openreview.net/forum?id=rJNpifWAb

Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., & Nowozin, S. (2020a). How good is the Bayes posterior in deep neural networks really? International Conference on Machine Learning, 10248–10259.

Wenzel, F., Snoek, J., Tran, D., & Jenatton, R. (2020b). Hyperparameter ensembles for robustness and uncertainty quantification. arXiv Preprint arXiv:2006.13570.

Wiatowski, T., & Bölcskei, H. (2017). A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Transactions on Information Theory, 64(3), 1845–1866.

Willems, F. M. J. (1998). The context-tree weighting method: extensions. IEEE Transactions on Information Theory, 792–798.

Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J. (1994). Context weighting for general finite context sources. IEEE Transactions on Information Theory.

Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3).

Williamson, S., Orbanz, P., & Ghahramani, Z. (2010a). Dependent Indian buffet processes. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 924–931.

Williamson, S., Wang, C., Heller, K., & Blei, D. (2010b). The IBP compound Dirichlet process and its application to focused topic modeling. Proceedings of ICML.

Wilson, A. G., & Nickisch, H. (2015). Kernel interpolation for scalable structured Gaussian processes (KISS-GP). International Conference on Machine Learning (ICML), 1775–1784.

Wilson, A. G. (2020). The case for Bayesian deep learning. arXiv Preprint arXiv:2001.10995.

Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. arXiv Preprint arXiv:2002.08791.

Winn, J. M., & Bishop, C. M. (2005). Variational message passing. Journal of Machine Learning Research, 6, 661–694.

Wintenberger, O. (2017). Optimal learning with Bernstein online aggregation. Machine Learning, 106.

Witten, D. M., & Tibshirani, R. (2009). Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3), 615–636.

Wood, J., & Shawe-Taylor, J. (1996). Representation theory and invariant neural networks. Discrete Applied Mathematics, 69(1-2), 33–60.

Wu, H., & Liu, X. (2016). Double thompson sampling for dueling bandits. Advances in Neural Information Processing Systems (NeurIPS).

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv Preprint arXiv:1708.07747.

Xie, C., Ye, H., Chen, F., Liu, Y., Sun, R., & Li, Z. (2020). Risk variance penalization. arXiv Preprint arXiv:2006.07544.

Xu, A., & Raginsky, M. (2017). Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30.

Xu, R., & II, D. W. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3).

Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiability of finite mixtures. Annals of Mathematics and Statistics, 39, 209–214.

Yang, J., Sun, S., & Roy, D. M. (2019). Fast-rate PAC-Bayes generalization bounds via shifted rademacher processes. Advances in Neural Information Processing Systems, 10802–10812.

Yao, L., Mimno, D., & McCallum, A. (2009). Efficient methods for topic model inference on streaming document collections. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 937–946.

Ykhlef, H., & Bouchaffra, D. (2017). An efficient ensemble pruning approach based on simple coalitional games. Information Fusion, 34, 28–42.

Yom-Tov, E., & Slonim, N. (2009). Parallel pairwise clustering. SIAM International Conference on Data Mining (SDM).

Yona, G. (1999). Methods for global organization of all known protein sequences [PhD thesis]. The Hebrew University of Jerusalem.

Yoo, J., & Choi, S. (2009a). Probabilistic matrix tri-factorization. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

Yoo, J., & Choi, S. (2009b). Weighted nonnegative matrix co-tri-factorization for collaborative prediction. Proceedings of the Asian Conference on Machine Learning (ACML).

Yu, H., Chen, Y., Low, B. K. H., Jaillet, P., & Dai, Z. (2019). Implicit posterior variational inference for deep Gaussian processes. Advances in Neural Information Processing Systems, 32, 14475–14486.

Yu, Y., Li, Y.-F., & Zhou, Z.-H. (2011). Diversity regularized machine. Twenty-Second International Joint Conference on Artificial Intelligence.

Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2012). The K-armed dueling bandits problem. Journal of Computer and System Sciences, 78.

Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision, 818–833.

Zhang, C., Butepage, J., Kjellstrom, H., & Mandt, S. (2018). Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. Proceedings of the International Conference on Learning Representations (ICLR).

Zhang, R., Li, C., Zhang, J., Chen, C., & Wilson, A. G. (2019). Cyclical stochastic gradient MCMC for Bayesian deep learning. International Conference on Learning Representations.

Zhang, T. et al. (2006). From epsilon-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5), 2180–2210.

Zhang, T. (2006). Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4), 1307–1321.

Zheng, Y., Li, Q., Chen, Y., Xie, X., & Ma, W.-Y. (2008). Understanding mobility based on GPS data. Proceedings of the 10th International Conference on Ubiquitous Computing, UbiComp ’08, 312–321. https://doi.org/10.1145/1409635.1409677

Zheng, Y., Xie, X., & Ma, W.-Y. (2010). GeoLife: A collaborative social networking service among user, location and trajectory. IEEE Data Eng. Bull., 33(2), 32–39.

Zheng, Y., Zhang, L., Xie, X., & Ma, W.-Y. (2009). Mining interesting locations and travel sequences from GPS trajectories. Proceedings of the 18th International Conference on World Wide Web, WWW ’09, 791–800. https://doi.org/10.1145/1526709.1526816

Zhou, W., Veitch, V., Austern, M., Adams, R. P., & Orbanz, P. (2019). Non-vacuous generalization bounds at the ImageNet scale: A PAC-Bayesian compression approach. International Conference on Learning Representations. https://openreview.net/forum?id=BJgqqsAct7

Zhou, X., Xie, L., Zhang, P., & Zhang, Y. (2014). An ensemble of deep neural networks for object tracking. 2014 IEEE International Conference on Image Processing (ICIP), 843–847.

Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms. CRC press.

Zhou, Z.-H., & Li, N. (2010). Multi-information ensemble diversity. International Workshop on Multiple Classifier Systems, 134–144.

Zhu, H., & Rohwer, R. (1995a). Information geometric measurements of generalisation.

Zhu, M. (2015). Use of majority votes in statistical learning. WIREs Computational Statistics, 7.

Zhu, S., & Rohwer, R. (1995b). Information geometry and prior construction. Entropy, 1, 3–22.

Zhu, S., An, B., & Huang, F. (2021a). Understanding the generalization benefit of model invariance from a data perspective. Advances in Neural Information Processing Systems, 34, 4328–4341.

Zhu, Z. A., Liu, Y., Li, Y., Li, M., Lin, W., Hong, M., & Jordan, M. I. (2021b). A geometric perspective on the transferability of adversarial directions. Advances in Neural Information Processing Systems, 34.

Zimmert, J., Luo, H., & Wei, C.-Y. (2019). Beating stochastic and adversarial semi-bandits optimally and simultaneously. Proceedings of the International Conference on Machine Learning (ICML).

Zimmert, J., & Seldin, Y. (2019). An optimal algorithm for stochastic and adversarial bandits. Proceedings on the International Conference on Artificial Intelligence and Statistics (AISTATS).

Zoghi, M., Karnin, Z., Whiteson, S., & Rijke, M. de. (2015). Copeland dueling bandits. Advances in Neural Information Processing Systems (NeurIPS).

Zoghi, M., Whiteson, S., Munos, R., & Rijke, M. de. (2014). Relative upper confidence bound for the K-armed dueling bandit problem. Proceedings of the International Conference on Machine Learning (ICML).

Zolghadr, N., Bartók, G., Greiner, R., György, A., & Szepesvári, C. (2013). Online learning with costly features and labels. Advances in Neural Information Processing Systems (NeurIPS).

Zou, D., Wu, J., Braverman, V., Gu, Q., Foster, D. P., & Kakade, S. (2021). The benefits of implicit regularization from SGD in least squares problems. Advances in Neural Information Processing Systems.