Bibliography
Abbasi-Yadkori, Y., Bartlett, P. L., Kanade, V., Seldin, Y.,
& Szepesvári, C. (2013). Online learning in markov decision
processes with adversarially chosen transition probability
distributions. Advances in Neural Information Processing
Systems (NeurIPS).
Abbasi-Yadkori, Y., Bartlett, P., Gabillon, V., Malek, A., &
Valko, M. (2018). Best of both worlds: Stochastic &
adversarial best-arm identification. Proceedings of the
Conference on Learning Theory (COLT).
Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011).
Improved algorithms for linear stochastic bandits. Advances
in Neural Information Processing Systems (NeurIPS).
Abbasi-Yadkori, Y., & Szepesvári, C. (2011). Regret bounds
for the adaptive control of linear quadratic systems.
Proceedings of the Conference on Learning Theory
(COLT).
Abbeel, P., Koller, D., & Y.Ng, A. (2006). Learning factor
graphs in polynomial time and sample complexity. Journal of
Machine Learning Research.
Abernethy, J., Hazan, E., & Rakhlin, A. (2008). Competing in
the dark: An efficient algorithm for bandit linear optimization.
Proceedings of the Conference on Learning Theory
(COLT).
Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2012).
Learning from data. AMLbook.
Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2015).
Learning from data. Dynamic e-chapters. AMLbook.
Adams, R. P., & MacKay, D. J. C. (2007).
Bayesian online changepoint detection. arXiv
Preprint arXiv:0710.3742.
Adi, Y., Schwing, A., & Hazan, T. (2020).
PAC-Bayesian neural network
bounds. https://openreview.net/forum?id=HkgR8erKwB
Agarwal, A., Dudík, M., Kale, S., Langford, J., & Schapire,
R. E. (2012). Contextual bandit learning with predictable
rewards. Proceedings on the International Conference on
Artificial Intelligence and Statistics (AISTATS).
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., &
Schapire, R. E. (2014). Taming the monster: A fast and simple
algorithm for contextual bandits. Proceedings of the
International Conference on Machine Learning (ICML).
Agarwal, A., Krishnamurthy, A., Langford, J., Luo, H., &
Schapire, R. E. (2017a). Open problem: First-order regret bounds
for contextual bandits. Proceedings of the Conference on
Learning Theory (COLT).
Agarwal, A., Luo, H., Neyshabur, B., & Schapire, R. E.
(2017b). Corralling a band of bandit algorithms. Proceedings
of the Conference on Learning Theory (COLT).
Aggarwal, C. C. (2007). Data streams: Models and
algorithms (Vol. 31). Springer Science &
Business Media.
Aggarwal, C. C. (2013). Managing and mining sensor
data. Springer Science & Business Media.
Ahissar, M., & Hochstein, S. (2004). The reverse hierarchy
theory of visual perceptual learning. TRENDS in Cognitive
Sciences, 8(10), 457–464.
Ahmed, A., Ho, Q., Teo, C. H., Eisenstein, J., Smola, A. J.,
& Xing, E. P. (2011). Online inference for the infinite
topic-cluster model: Storylines from streaming text.
Proceedings on the International Conference on Artificial
Intelligence and Statistics (AISTATS), 101–109.
Ailon, N., Karnin, Z., & Joachims, T. (2014). Reducing
dueling bandits to cardinal bandits. Proceedings of the
International Conference on Machine Learning (ICML).
Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., &
Watson, J. D. (1994). Molecular biology of the cell
(3rd ed.). Garland Publishing.
Allenby, G. M., & Rossi, P. E. (1998). Marketing models of
consumer heterogeneity. Journal of Econometrics,
89(1-2), 57–78.
Alon, N., Cesa-Bianchi, N., Gentile, C., & Mansour, Y.
(2013). From bandits to experts: A tale of domination and
independence. Advances in Neural Information Processing
Systems (NeurIPS).
Alquier, P. (2024). User-friendly introduction to PAC-bayes
bounds. Foundations and Trends in Machine Learning,
17(2), 174–303. https://doi.org/10.1561/2200000100
Alquier, P., & Guedj, B. (2018). Simpler
PAC-Bayesian bounds for hostile data. Machine
Learning, 107(5), 887–902.
Alquier, P., Ridgway, J., & Chopin, N. (2016). On the
properties of variational approximations of Gibbs
posteriors. The Journal of Machine Learning Research,
17(1), 8374–8414.
Alter, O., Brown, P. O., & Botstein, D. (2003). Generalized
singular value decomposition for comparative analysis of
genome-scale expression data sets of two different organisms.
Proceedings of the National Academy of Science.
Álvarez, M. A., & Lawrence, N. D. (2011). Computationally
efficient convolved multiple output Gaussian
processes. Journal of Machine Learning Research,
12, 1459–1500.
Ambroladze, A., Parrado-Hernández, E., & Shawe-Taylor, J.
(2007). Tighter PAC-Bayes bounds. Advances in
Neural Information Processing Systems (NeurIPS).
Aminikhanghahi, S., & Cook, D. J. (2017). A survey of
methods for time series change point detection. Knowledge
and Information Systems, 51(2), 339–367.
Angluin, D. (2004). Queries revisited. Theoretical Computer
Science, 313.
Anjos, O., Iglesias, C., Peres, F., Martı́nez, J., Garcia, A.,
& Taboada, J. (2015). Neural networks applied to
discriminate botanical origin of honeys. Food
Chemistry, 175, 128–136.
Anthony, M., & Bartlett, P. L. (1999). Neural network
learning: Theoretical foundations. Cambridge University
Press.
Antorán, J., Padhy, S., Barbano, R., Nalisnick, E. T., Janz, D.,
& Hernández-Lobato, J. M. (2023). Sampling-based inference
for large linear models, with application to linearised
Laplace. International Conference on Learning
Representations.
Antos, A., & Kontoyiannis, I. (2001). Convergence properties
of functional estimates for discrete distributions. Random
Structures and Algorithms, 19(3-4).
Apostolico, A., & Bejerano, G. (2000). Optimal amnesic
probabilistic automata or how to learn and classify proteins in
linear time and space. Jcb, 7(3), 381–393.
Arimoto, S. (1972). An algorithm for computing the capacity of
discrete memoryless channel. IEEE Transactions on
Information Theory, 18.
Aronszajn, N. (1950). Theory of reproducing kernels.
Transactions of the American Mathematical Society,
68(3), 337–404.
Arora, S., Cohen, N., & Hazan, E. (2018a). On the
optimization of deep networks: Implicit acceleration by
overparameterization. International Conference on Machine
Learning, 244–253.
Arora, S., Ge, R., Neyshabur, B., & Zhang, Y. (2018b).
Stronger generalization bounds for deep nets via a compression
approach. International Conference on Machine Learning,
254–263.
Arora, S., Liang, Y., & Ma, T. (2016). Why are deep nets
reversible: A simple theory, with implications for
training. https://arxiv.org/abs/1511.05653
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler,
H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S.,
Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L.,
Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E.,
Ringwald, M., Rubin, G. M., & Sherlock, G. (2000). Gene
ontology: Tool for the unification of biology. Nature
Genetics, 25, 25–29.
Asmuth, J., Li, L., Littman, M. L., Nouri, A., & Wingate, D.
(2009). A Bayesian sampling approach to exploration
in reinforcement learning. Proceedings of the Conference on
Uncertainty in Artificial Intelligence.
Asuncion, A., & Newman, D. J. (2007a). UCI
machine learning repository.
Asuncion, A., & Newman, D. J. (2007b). UCI
machine learning repository. University of California,
Irvine, School of Information; Computer Sciences. www.ics.uci.edu/~mlearn/MLRepository.html
Athreya, K. B., & Lahiri, S. N. (2006). Measure theory
and probability theory. Springer.
Attwood, T. K., Croning, M. D., Flower, D. R., Lewis, A. P.,
Mabey, J. E., Scordis, P., Selley, J. N., & Wright, W.
(2000). PRINTS-S: The database formerly known as
PRINTS. Nucleic Acids Research,
28(1), 225–227.
Audibert, J. Y., Munos, R., & Szepesvári, C. (2009).
Exploration-exploitation trade-off using variance estimates in
multi-armed bandits. Theoretical Computer Science.
Audibert, J.-Y., & Bousquet, O. (2007). Combining
PAC-Bayesian and generic chaining bounds.
Journal of Machine Learning Research.
Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for
adversarial and stochastic bandits. Proceedings of the
Conference on Learning Theory (COLT).
Audibert, J.-Y., & Bubeck, S. (2010). Regret bounds and
minimax policies under partial monitoring. Journal of
Machine Learning Research, 11.
Audibert, J.-Y., Bubeck, S., & Munos, R. (2010). Best arm
identification in multi-armed bandits. Proceedings of the
Conference on Learning Theory (COLT).
Auer, P. (2002). Using confidence bounds for
exploration-exploitation trade-offs. Journal of Machine
Learning Research, 3.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a).
Finite-time analysis of the multiarmed bandit problem.
Machine Learning, 47.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E.
(1995). Gambling in a rigged casino: The adversarial multi-armed
bandit problem. Annual IEEE Symposium on Foundations of
Computer Science.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E.
(2002b). The nonstochastic multiarmed bandit problem. SIAM
Journal of Computing, 32(1).
Auer, P., Cesa-Bianchi, N., & Gentile, C. (2002c). Adaptive
and self-confident on-line learning algorithms. Journal of
Computer and System Sciences, 64.
Auer, P., & Chiang, C.-K. (2016). An algorithm with nearly
optimal pseudo-regret for both stochastic and adversarial
bandits. Proceedings of the Conference on Learning Theory
(COLT).
Auer, P., & Ortner, R. (2010). UCB revisited:
Improved regret bounds for the stochastic multi-armed bandit
problem. Periodica Mathematica Hungarica,
61(1-2), 55–65.
Avner, O., Mannor, S., & Shamir, O. (2012). Decoupling
exploration and exploitation in multi-armed bandits.
Proceedings of the International Conference on Machine
Learning (ICML).
Azuma, K. (1967). Weighted sums of certain dependent random
variables. Tôhoku Mathematical Journal, 19(3).
Bach, F. (2017). On the equivalence between quadrature rules and
random features. Advances in Neural Information Processing
Systems 30, 456–467.
Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013).
Bandits with knapsacks. Annual IEEE Symposium on Foundations
of Computer Science.
Bairoch, A., & Apweiler, R. (2000). The
SWISS-PROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic
Acids Research, 28(1), 45–48.
Baldi, P., Sadowski, P., & Whiteson, D. (2014). Searching
for exotic particles in high-energy physics with deep learning.
Nature Communications, 5(1), 1–9.
Balog, M., Salakhutdinov, R., & Ghahramani, Z. (2016).
Mondrian forests for large-scale regression when uncertainty
matters. International Conference on Artificial Intelligence
and Statistics (AISTATS), 1119–1127.
Banerjee, A. (2006a). On Bayesian bounds.
Proceedings of the International Conference on Machine
Learning (ICML).
Banerjee, A. (2006b). On Bayesian bounds.
International Conference on Machine Learning, 81–88.
Banerjee, A., Dhillon, I. S., Ghosh, J., Merugu, S., &
Modha, D. S. (2007). A generalized maximum entropy approach to
Bregman co-clustering and matrix approximation.
Journal of Machine Learning Research, 8.
Barber, D. (2012). Bayesian reasoning and
machine learning. Cambridge University Press.
Barnard, K., Duygulu, P., & Forsyth, D. (2002). Modeling the
statistics of image features and associated text. SPIE
Electronic Imaging 2002, Document Recognition and Retrieval
IX.
Barndorff-Nielsen, O. (2014). Information and exponential
families: In statistical theory. John Wiley & Sons.
Barron, A., & Cover, T. (1991). Minimum complexity density
estimation. IEEE Transactions on Information Theory.
Barron, A., Rissanen, J., & Yu, B. (1998). The minimum
description length principle in coding and modeling. IEEE
Transactions on Information Theory, 44, 2743–2760.
Bartlett, P. L., Boucheron, S., & Lugosi, G. (2001). Model
selection and error estimation. Machine Learning.
Bartlett, P. L., Collins, M., Taskar, B., & McAllester, D.
(2005). Exponentiated gradient algorithms for large-margin
structured classification. Advances in Neural Information
Processing Systems (NeurIPS).
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017).
Spectrally-normalized margin bounds for neural networks.
Advances in Neural Information Processing Systems,
30.
Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A.
(2019). Nearly-tight VC-dimension and
pseudodimension bounds for piecewise linear neural networks.
The Journal of Machine Learning Research,
20(1), 2285–2301.
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A.
(2020). Benign overfitting in linear regression. Proceedings
of the National Academy of Sciences, 117(48),
30063–30070.
Bartlett, P. L., & Mendelson, S. (2001). Rademacher and
Gaussian complexities: Risk bounds and structural
results. Proceedings of the Conference on Learning Theory
(COLT).
Bartlett, P. L., Montanari, A., & Rakhlin, A. (2021). Deep
learning: A statistical viewpoint. Acta Numerica,
30, 87–201.
Bartlett, P. L., & Tewari, A. (2009). REGAL: A
regularization based algorithm for reinforcement learning in
weakly communicating MDPs. Proceedings of the
Conference on Uncertainty in Artificial Intelligence.
Bartlett, P., Maiorov, V., & Meir, R. (1998). Almost linear
VC dimension bounds for piecewise polynomial networks.
Advances in Neural Information Processing Systems,
11.
Bartók, G., Foster, D., Pál, D., Rakhlin, A., & Szepesvári,
C. (2014). Partial monitoring – classification, regret bounds,
and algorithms. Mathematics of Operations Research,
36(4).
Bartók, G., Pál, D., & Szepesvári, C. (2011). Minimax regret
of finite partial-monitoring games in stochastic environments.
Proceedings of the Conference on Learning Theory
(COLT).
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L.,
& Sonnhammer, E. L. (2000). The Pfam protein
families database. Nucleic Acids Research,
28(1), 263–266.
Bauer, M., Wilk, M. van der, & Rasmussen, C. E. (2016).
Understanding probabilistic sparse Gaussian process
approximations. Advances in Neural Information Processing
Systems, 29, 1533–1541.
Beal, M. J. (2003). Variational algorithms for approximate
Bayesian inference (Publication May; pp.
1–281) [PhD thesis]. Gatsby Computational Neuroscience Unit,
University College London.
Becker, R. A. (2012). The variance drain and
Jensen’s inequality. CAEPR Working Paper No.
2012-004. http://dx.doi.org/10.2139/ssrn.2027471
Behboodi, A., Cesa, G., & Cohen, T. S. (2022). A
PAC-Bayesian generalization bound for equivariant
networks. Advances in Neural Information Processing
Systems, 35, 5654–5668.
Bejerano, G., Seldin, Y., Margalit, H., & Tishby, N. (2001).
Markovian domain fingerprinting: Statistical segmentation of
protein sequences. Bioinformatics, 17(10),
927–934.
Bejerano, G., & Yona, G. (1999). Modeling protein families
using probabilistic suffix trees. In S. Istrail, P. Pevzner,
& M. Waterman (Eds.), RECOMB99: 3rd international
conference on computational molecular biology (pp. 15–24).
Bejerano, G., & Yona, G. (2001). Variations on probabilistic
suffix trees: Statistical modeling and prediction of protein
families. BioInfo, 17(1), 23–43.
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019).
Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proceedings of the National Academy
of Sciences, 116(32), 15849–15854.
Ben-David, S., & Luxburg, U. von. (2008). Relating
clustering stability to properties of cluster boundaries.
Proceedings of the Conference on Learning Theory
(COLT).
Ben-David, S., Luxburg, U. von, & Pál, D. (2006). A sober
look on clustering stability. Advances in Neural Information
Processing Systems (NeurIPS).
Ben-David, S., Pál, D., & Simon, H.-U. (2007). Stability of
k-means clustering.
Proceedings of the Conference on Learning Theory
(COLT).
Bengio, Y. (2009). Learning deep architectures for
AI. Foundations and Trends in Machine
Learning.
Ben-Hur, A., Elisseeff, A., & Guyon., I. (2002). A stability
based method for discovering structure in clustered data.
Pacific Symposium on Biocomputing.
Berend, D., & Kontorovich, A. (2016). A finite sample
analysis of the naive Bayes classifier. Journal
of Machine Learning Research.
Bergamin, F., Moreno-Muñoz, P., Hauberg, S., & Arvanitidis,
G. (2023). Riemannian Laplace approximations for
Bayesian neural networks. Advances in Neural
Information Processing Systems.
Berger, J. O., Moreno, E., Pericchi, L. R.,
Bayarri, M. J., Bernardo, J. M., Cano, J. A., De la Horra, J.,
Martı́n, J., Rı́os-Insúa, D., Betrò, B., et al. (1994). An
overview of robust Bayesian analysis.
Test, 3(1), 5–124.
Bergmann, S., Stelzer, S., & Strassburger, S. (2014). On the
use of artificial neural networks in simulation-based
manufacturing control. Journal of Simulation,
8(1), 76–90.
Berk, R. H. et al. (1966). Limiting
behavior of posterior distributions when the model is incorrect.
The Annals of Mathematical Statistics, 37(1),
51–58.
Bernardo, J. M., & Smith, A. F. (2009).
Bayesian theory (Vol. 405). John Wiley
& Sons.
Bernstein, S. N. (1946). Probability theory (4th).
Bertin-Mahieux, T. (2011). Year Prediction
MSD. UCI Machine Learning Repository.
Beygelzimer, A., Dasgupta, S., & Langford, J. (2009).
Importance weighted active learning. Proceedings of the
International Conference on Machine Learning (ICML).
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., &
Schapire, R. (2011). Contextual bandit algorithms with
supervised learning guarantees. Proceedings on the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., &
Schapire, R. E. (2010). Contextual bandit algorithms with
supervised learning guarantees.
http://arxiv.org/abs/1002.4058.
Bialek, W., Nemenman, I., & Tishby, N. (2001).
Predictability, complexity, and learning. Neural
Computation, 13, 2409–2463.
Bian, Y., & Chen, H. (2021). When does diversity help
generalization in classification ensembles? IEEE
Transactions on Cybernetics.
Bietti, A., & Mairal, J. (2019). Group invariance principles
for causal generative models. The 22nd International
Conference on Artificial Intelligence and Statistics,
557–566.
Bietti, A., Venturi, L., & Bruna, J. (2021). On the sample
complexity of learning under geometric stability. Advances
in Neural Information Processing Systems, 34.
Billingsley, P. (1995). Probability and measure (3rd
ed.). John Wiley & Sons.
Bishop, C. M., & Svensen, M. (2003). Bayesian hierarchical
mixtures of experts. Proceedings of the Conference on
Uncertainty in Artificial Intelligence.
Bishop, C. M. (1998). Latent variable models. In Learning in
graphical models (pp. 371–403). Springer.
Bishop, C. M. (1995a). Neural networks for pattern
recognition. Clarendon press.
Bishop, C. M. (1995b). Neural networks for pattern
recognition. Oxford University Press.
Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer.
Bissiri, P. G., Holmes, C. C., & Walker, S. G. (2016). A
general framework for updating belief distributions. Journal
of the Royal Statistical Society: Series B (Statistical
Methodology), 78(5), 1103–1130.
Blahut, R. E. (1972). Computation of channel capacity and rate
distortion functions. IEEE Transactions on Information
Theory, 18.
Blanchard, G., & Fleuret, F. (2007). Occam’s hammer.
Proceedings of the Conference on Learning Theory
(COLT).
Blei, D. M. (2014). Build, compute, critique, repeat:
Data analysis with latent variable models.
Annual Review of Statistics and Its Application,
1, 203–232.
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017).
Variational inference: A review for statisticians. Journal
of the American Statistical Association, 112(518),
859–877. https://doi.org/10.1080/01621459.2017.1285773
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models.
Proceedings of the 23rd International Conference on Machine
Learning, 113–120.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent
Dirichlet allocation. Journal of Machine
Learning Research, 3, 993–1022.
Blei, D., & Lafferty, J. (2009). Topic models. In A.
Srivastava & M. Sahami (Eds.), Text mining: Theory and
applications. Taylor; Francis.
Blight, B., & Ott, L. (1975). A Bayesian
approach to model inadequacy for polynomial regression.
Biometrika, 62(1), 79–88.
Bloem-Reddy, B., & Teh, Y. W. (2020). Probabilistic
symmetries and invariant neural networks. Journal of Machine
Learning Research, 21, "90–1".
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D.
(2015). Weight uncertainty in neural networks. International
Conference on Machine Learning, 1613–1622.
Bogachev, V. I. (1998). Gaussian measures
(Vol. 62). American Mathematical Society.
Bogachev, V. I. (2007). Measure theory. Springer.
Bonilla, E. V., Krauth, K., & Dezfouli, A. (2018).
Generic inference in latent Gaussian process
models. https://arxiv.org/abs/1609.00577
Bonilla, E. V., Krauth, K., & Dezfouli, A. (2019). Generic
inference in latent Gaussian process models. J.
Mach. Learn. Res., 20, 117–111.
Borchani, H., Martı́nez, A. M., Masegosa, A. R., Langseth, H.,
Nielsen, T. D., Salmerón, A., Fernández, A., Madsen, A. L.,
& Sáez, R. (2015). Modeling concept drift: A probabilistic
graphical model based approach. International Symposium on
Intelligent Data Analysis, 72–83.
Bork, P. (1992). Mobile modules and motifs.
Curr. Opin. Struct. Biol., 2, 413–421.
Bork, P., & Koonin, E. V. (1996). Protein sequence motifs.
Curr. Opin. Struct. Biol., 6(3), 366–376.
Botev, A., Ritter, H., & Barber, D. (2017). Practical
Gauss-Newton optimisation for deep
learning. International Conference on Machine Learning,
557–565.
Bottou, L. (2010). Large-scale machine learning with stochastic
gradient descent. In Proceedings of COMPSTAT’2010 (pp.
177–186). Springer.
Bottou, L., & Bousquet, O. (2011). The tradeoffs of large
scale learning. In S. Sra, S. Nowozin, & S. J. Wright
(Eds.), Optimization for machine learning (pp.
351–368). MIT Press.
Boucheron, S., Lugosi, G., & Bousquet, O. (2004).
Concentration inequalities. In O. Bousquet, U. v. Luxburg, &
G. Rätsch (Eds.), Advanced lectures in machine
learning. Springer.
Boucheron, S., Lugosi, G., & Bousquet, O. (2005). Theory of
classification: A survey of recent advances. ESAIM:
Probability and Statistics.
Boucheron, S., Lugosi, G., & Massart, P. (2013a).
Concentration inequalities A nonasymptotic
theory of independence. Oxford University Press.
Boucheron, S., Lugosi, G., & Massart, P. (2013b).
Concentration inequalities: A nonasymptotic theory of
independence. Oxford university press.
Bousquet, O., & Elisseeff, A. (2002). Stability and
generalization. Journal of Machine Learning Research.
Box, G. E. (1976). Science and statistics. Journal of the
American Statistical Association, 71(356),
791–799.
Boyd, S., & Vandenberghe, L. (2004). Convex
optimization. Cambridge university press.
Breiman, L. (1996a). Bagging predictors. Machine
Learning, 24(2).
Breiman, L. (1996b). Bagging predictors. Machine
Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning,
45(1), 5–32.
Bresler, G., Mossel, E., & Sly, A. (2008). Reconstruction of
Markov random fields from samples: Some easy
observations and algorithms. 11th
International Workshop, APPROX 2008, and 12th
International Workshop, RANDOM 2008, LNCS 5171.
Broderick, T., Boy, N., Wibisono, A., Wilson, A. C., &
Jordan, M. I. (2013). Streaming variational Bayes.
In Advances in neural information processing systems 26
(pp. 1727–1735). Curran Associates, Inc.
Bronstein, M. M., Bruna, J., Cohen, T., & Veličković, P.
(2021). Geometric deep learning: Grids, groups, graphs,
geodesics, and gauges. arXiv Preprint arXiv:2104.13478.
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011).
Handbook of markov chain monte carlo. CRC Press.
Brost, B., Cox, I. J., Seldin, Y., & Lioma, C. (2016a). An
improved multileaving algorithm for online ranker evaluation.
Proceedings of the 39th International ACM SIGIR Conference
on Research and Development in Information Retrieval : SIGIR
’16. Association for Computing Machinery.
Brost, B., Seldin, Y., Cox, I. J., & Lioma, C. (2016b).
Multi-dueling bandits and their application to online ranker
evaluation. Proceeding of the 25th ACM International
Conference on Information and Knowledge Management (CIKM).
Brown, G. (2009). An information theoretic perspective on
multiple classifier systems. International Workshop on
Multiple Classifier Systems, 344–353.
Brown, G., Wyatt, J. L., & Tiňo, P. (2005). Managing
diversity in regression ensembles. Journal of Machine
Learning Research, 6(Sep), 1621–1650.
Brown, L. D. (1986). Fundamentals of statistical exponential
families: With applications in statistical decision theory.
Bu, Y., Zou, S., & Veeravalli, V. V. (2020). Tightening
mutual information-based bounds on generalization error.
IEEE Journal on Selected Areas in Information Theory,
1(1), 121–130. https://doi.org/10.1109/JSAIT.2020.2991139
Bubeck, S. (2010). Bandits games and clustering
foundations [PhD thesis]. Université Lille.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of
stochastic and nonstochastic multi-armed bandit problems.
Foundations and Trends in Machine Learning, 5.
Bubeck, S., Li, Y., & Nagaraj, D. M. (2021). A law of
robustness for two-layers neural networks. Conference on
Learning Theory, 804–820.
Bubeck, S., & Sellke, M. (2023). A universal law of
robustness via isoperimetry. Journal of the ACM,
70(2), 1–18.
Bubeck, S., & Slivkins, A. (2012). The best of both worlds:
Stochastic and adversarial bandits. Proceedings of the
Conference on Learning Theory (COLT).
Buetler, T. M., & Eaton, D. L. (1992). Glutathione
S-transferases: Amino acid sequence comparison,
classification and phylogentic relationship.
Environ. Carcinogen. Ecotoxicol. Rev., C10,
181–203.
Bui, T. D., Nguyen, C. V., & Turner, R. E. (2016a). Deep
Gaussian processes for regression using approximate
expectation propagation. International Conference on Machine
Learning (ICML), 1472–1481.
Bui, T. D., Yan, J., & Turner, R. E. (2017). A unifying
framework for Gaussian process pseudo-point
approximations using power expectation propagation. Journal
of Machine Learning Research, 18, 1–72.
Bui, T., Hernández-Lobato, D., Hernandez-Lobato, J., Li, Y.,
& Turner, R. (2016b). Deep Gaussian processes
for regression using approximate expectation propagation.
International Conference on Machine Learning,
1472–1481.
Buntine, W. (1994). Operations for learning with graphical
models. Journal of Artificial Intelligence Research,
2.
Buntine, W., & Jaakkola, T. (2022). Alpha–divergences,
expectation propagation and Bayesian neural
networks. International Conference on Artificial
Intelligence and Statistics (AISTATS), 1234–1242.
Burt, D. R., & Rasmussen, C. E. (2020). Convergence of
sparse variational inference in Gaussian processes.
Journal of Machine Learning Research, 21(131),
1–63.
Buschjäger, S., Pfahler, L., & Morik, K. (2020). Generalized
negative correlation learning for deep ensembling. arXiv
Preprint arXiv:2011.02952.
Cabañas, R., Martı́nez, A. M., Masegosa, A. R., Ramos-López, D.,
Samerón, A., Nielsen, T. D., Langseth, H., & Madsen, A. L.
(2016). Financial data analysis with PGMs using
AMIDST. Data Mining Workshops (ICDMW), 2016
IEEE 16th International Conference On, 1284–1287.
Cantelli, F. (1933). Sulla determinazione empirica della leggi
di probabilita. G. Inst. Ital. Attuari, 4.
Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., &
Stoltz, G. (2013). Kullback–Leibler upper
confidence bounds for optimal sequential allocation. The
Annals of Statistics, 41(3).
Casado, I., Ortega, L. A., Masegosa, A. R., & Pérez, A.
(2024). PAC-Bayes-Chernoff bounds for
unbounded losses. Proceedings of the 38th Conference on
Neural Information Processing Systems.
Catoni, O. (2007a). PAC-Bayesian supervised
classification: The thermodynamics of statistical learning.
IMS Lecture Notes Monograph Series, 56.
Catoni, O. (2007b). PAC-Bayesian
supervised classification: The thermodynamics of statistical
learning. arXiv Preprint arXiv:0712.0248.
Cesa-Bianchi, N., & Fischer, P. (1998). Finite-time regret
bounds for the multiarmed bandit problem. Proceedings of the
International Conference on Machine Learning (ICML).
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P.,
Schapire, R. E., & Warmuth, M. K. (1997). How to use expert
advice. Journal of the ACM, 44(3).
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction,
learning, and games. Cambridge University Press.
Cesa-Bianchi, N., & Lugosi, G. (2012). Combinatorial
bandits. Journal of Computer and Systems Sciences,
78.
Cesa-Bianchi, N., Lugosi, G., & Stoltz, G. (2005).
Minimizing regret with label efficient prediction. IEEE
Transactions on Information Theory, 51.
Cesa-Bianchi, N., Mansour, Y., & Stoltz, G. (2007). Improved
second-order bounds for prediction with expert advice.
Machine Learning, 66.
Cesa-Bianchi, N., & Shamir, O. (2017). Bandit regret
scaling with the effective loss range.
https://arxiv.org/abs/1705.05091.
Chafaı̈, D. (2004). Entropies, convexity, and functional
inequalities, on Φ-entropies and
Φ-sobolev
inequalities. Journal of Mathematics of Kyoto
University, 44(2), 325–363.
Chaitin, G. J. (1966). On the length of programs for computing
finite binary sequences. Journal of the Association of
Computing Machinery, 13, 547–569.
Chandra, A., & Yao, X. (2004). DIVACE: Diverse and accurate
ensemble learning algorithm. International Conference on
Intelligent Data Engineering and Automated Learning,
619–625.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A
library for support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2.
Chapelle, O., & Li, L. (2011). An empirical evaluation of
thompson sampling. Advances in Neural Information Processing
Systems (NeurIPS).
Chaudhari, P., & Soatto, S. (2018). Stochastic gradient
descent performs variational inference, converges to limit
cycles for deep networks. International Conference on
Learning Representations. https://arxiv.org/abs/1710.11029
Chechik, G., Globerson, A., Tishby, N., & Weiss, Y. (2005).
Gaussian information bottleneck. Journal of
Machine Learning Research, 6, 165–188.
Chechik, G., & Tishby, N. (2002). Extracting relevant
structures with side information. Advances in Neural
Information Processing Systems (NeurIPS).
Chen, B., & Frazier, P. I. (2017). Dueling bandits with weak
regret. Proceedings of the International Conference on
Machine Learning (ICML).
Chen, C.-P., & Qi, F. (2003). The best lower and upper
bounds of harmonic sequence.
Chen, S., Dobriban, E., & Lee, J. H. (2020). A
group-theoretic framework for data augmentation. The Journal
of Machine Learning Research, 21(1), 9885–9955.
Chen, S., Dobriban, E., & Lee, J. D. (2019). Invariance
reduces variance: Understanding data augmentation in deep
learning and beyond. arXiv Preprint arXiv:1907.10905.
Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic
gradient hamiltonian monte carlo. International Conference
on Machine Learning, 1683–1691.
Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. https://doi.org/10.1145/2939672.2939785
Chen, X., Irie, K., Banks, D., Haslinger, R., Thomas, J., &
West, M. (2018). Scalable Bayesian modeling,
monitoring, and analysis of dynamic network flow data.
Journal of the American Statistical Association,
113(522), 519–533.
Cheng, C.-A., & Boots, B. (2016). Incremental variational
sparse Gaussian process regression. Advances in
Neural Information Processing Systems, 29,
4403–4411.
Cheng, C.-A., & Boots, B. (2017). Variational inference for
Gaussian process models with linear complexity.
Advances in Neural Information Processing Systems,
30, 5184–5194.
Cheng, Y., & Church, G. M. (2000). Biclustering of
expression data. Proceedings of the 8th
International Conference on Intelligent Systems for Molecular
Biology (ISMB).
Chérief-Abdellatif, B.-E., & Alquier, P. (2019).
MMD-Bayes: Robust Bayesian estimation
via maximum mean discrepancy. arXiv Preprint
arXiv:1909.13339.
Chernoff, H. (1952a). A measure of asymptotic efficiency for
tests of a hypothesis based on the sum of observations.
Annals of Mathematical Statistics, 23.
Chernoff, H. (1952b). A measure of asymptotic efficiency for
tests of a hypothesis based on the sum of observations. The
Annals of Mathematical Statistics, 493–507.
Cho, H., & Dhillon, I. S. (2008). Co-clustering of human
cancer microarrays using minimum sum-squared residue
co-clustering. IEEE/ACM Transactions on Computational
Biology and Bioinformatics (TCBB), 5(3).
Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum
sum-squared residue co-clustering of gene expression data.
Proceedings of the Fourth SIAM International Conference on
Data Mining.
Cho, Y., & Saul, L. (2009). Kernel methods for deep
learning. Advances in Neural Information Processing
Systems, 22.
Choi, T., Ramamoorthi, R., et al.
(2008). Remarks on consistency of posterior distributions. In
Pushing the limits of contemporary statistics: Contributions
in honor of jayanta k. ghosh (pp. 170–186). Institute of
Mathematical Statistics.
Chow, C. K., & Liu, C. N. (1968). Approximating discrete
probability distributions with dependence trees. IEEE
Transactions on Information Theory, IT-14(3),
462–467.
Chui, C. (1992). An introduction to wavelets.
Chung, F., & Lu, L. (2006). Concentration inequalities and
martingale inequalities: A survey. Internet
Mathematics, 3(2).
Claesen, M., Smet, F. D., Suykens, J. A. K., & Moor, B. D.
(2014). EnsembleSVM: A library for ensemble
learning using support vector machines. Journal of Machine
Learning Research, 15(1).
Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel
mixture of SVMs for very large scale problems.
Neural Computation, 14(5).
Composite Learning for Artificial
Cognitive Systems (CompLACS). (n.d.).
Conway, J. B. (1990). A course in functional analysis
(2nd ed.). Springer.
Corfield, D., Schölkopf, B., & Vapnik, V. N. (2009).
Falsification and statistical learning theory: Comparing the
Popper and Vapnik-Chervonenkis
dimensions. Journal for General Philosophy of Science,
40, 51–58.
Cortes, C., & Vapnik, V. (1995). Support-vector networks.
Machine Learning, 20(3).
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J.
(2009). Modeling wine preferences by data mining from
physicochemical properties. https://archive.ics.uci.edu/ml/datasets/wine+quality
Cover, T. M. (1972). Admissibility properties of Gilbert?s encoding for unknown source
probabilities. IEEE Transactions on Information Theory,
18.
Cover, T. M., & Thomas, J. A. (1991). Elements of
information theory. John Wiley & Sons.
Cover, T. M., & Thomas, J. A. (2006). Elements of
information theory (2nd ed.). Wiley Series in
Telecommunications; Signal Processing.
Cover, T. M., & Thomas, J. A. (2012). Elements of
information theory. John Wiley & Sons.
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., &
Spiegelhalter, D. J. (2007). Probabilistic networks and
expert systems. Exact computational methods for bayesian
networks. Springer.
Cramér, H. (1938). Sur un nouveau théoreme-limite
de la théorie des probabilités.
Actual. Sci. Ind., 736, 5–23.
Crammer, K., Mohri, M., & Pereira, F. (2009).
Gaussian margin machines. Proceedings on the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Cristianini, N., & Shawe-Taylor, J. (2000). An
introduction to support vector machines. Cambridge
University Press.
Csiszar, I. (1974). On the computation of rate distortion
functions. IEEE Transactions on Information Theory,
20, 122–124.
Csiszár, I., & Tusnády, G. (1984). Information geometry and
alternating minimization procedures. Statistics &
Decisions, Supplement Issue 1.
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V.
(2019). Autoaugment: Learning augmentation strategies from data.
Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 113–123.
Cunningham, P., & Carney, J. (2000). Diversity versus
quality in classification ensembles based on feature selection.
European Conference on Machine Learning, 109–116.
Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M.
(2017). Random feature expansions for deep Gaussian
processes. International Conference on Machine
Learning, 884–893.
Da Prato, G., & Zabczyk, J. (2014). Stochastic equations
in infinite dimensions (2nd ed.). Cambridge University
Press.
Dai, Z., Damianou, A., González, J., & Lawrence, N. (2016).
Variational auto-encoded deep Gaussian processes.
4th International Conference on Learning Representations,
ICLR 2016.
Damianou, A., & Lawrence, N. D. (2013). Deep
Gaussian processes. Artificial Intelligence and
Statistics, 207–215.
Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic
linear optimization under bandit feedback. Proceedings of
the Conference on Learning Theory (COLT).
Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., & Ré, C.
(2019). A kernel theory of modern data augmentation.
International Conference on Machine Learning,
1528–1537.
Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer,
M., & Hennig, P. (2021a). Laplace Redux -
Effortless Bayesian deep learning.
Advances in Neural Information Processing Systems,
34, 20089–20103.
Daxberger, E., Nalisnick, E., Allingham, J. U., Antoran, J.,
& Hernandez-Lobato, J. M. (2021b). Bayesian
deep learning via subnetwork inference. International
Conference on Machine Learning, 2510–2521.
Deisenroth, M. P., Fox, D., & Rasmussen, C. E. (2015).
Gaussian processes for data-efficient learning
in robotics and control (Vol. 118). Springer.
Dembo, A., & Zeitouni, O. (1998). Large deviations
techniques and applications (2nd ed., Vol. 38). Springer.
https://doi.org/10.1007/978-1-4612-5320-3
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum
likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, B,
39(1), 1–38.
Deng, L. (2012). The mnist database of handwritten digit images
for machine learning research. IEEE Signal Processing
Magazine, 29, 141–142.
Deng, Z., Zhou, F., & Zhu, J. (2022). Accelerated linearized
Laplace approximation for Bayesian
deep learning. Advances in Neural Information Processing
Systems, 35, 2695–2708.
Deng, Z., & Zhu, J. (2023). Bayesadapter: Being
Bayesian, inexpensively and reliably, via
Bayesian fine-tuning. Asian Conference on
Machine Learning, 280–295.
Derbeko, P., El-Yaniv, R., & Meir, R. (2004). Explicit
learning curves for transduction and application to clustering
and compression algorithms. Journal of Artificial
Intelligence Research, 22.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A
probabilistic theory of pattern recognition. Springer.
Devroye, L., & Lugosi, G. (2001). Combinatorial methods
in density estimation. Springer.
Dhillon, I. S., Mallela, S., & Modha, D. S. (2003).
Information-theoretic co-clustering. Proceedings of the
International Conference on Knowledge Discovery and Data Mining
(ACM SIGKDD).
Dietterich, T. G. (2000). Ensemble methods in machine learning.
International Workshop on Multiple Classifier Systems,
1–15.
Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal
nonnegative matrix tri-factorizations for clustering.
Proceedings of the International Conference on Knowledge
Discovery and Data Mining (ACM SIGKDD).
Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017).
Sharp minima can generalize for deep nets. Proceedings of
the 34th International Conference on Machine Learning-Volume
70, 1019–1028.
Do, M. N., & Vetterli, M. (2000). Texture similarity
measurement using kullback-leibler distance on wavelet subbands.
In Proceedings of IEEE International Conference on Image
Processing, ICIP-2000.
Domingues, R., Michiardi, P., Zouaoui, J., & Filippone, M.
(2018). Deep Gaussian process autoencoders for
novelty detection. Machine Learning, 107(8),
1363–1383.
Donoho, D., & Huo, X. (2001). Beamlets and multiscale image
analysis. Lecture Notes in Computational Science and
Engineering: Multiscale and Multiresolution Methods.
Springer.
Donsker, M. D., & Varadhan, S. R. S. (1975a). Asymptotic
evaluation of certain Markov process expectations
for large time. Communications on Pure and Applied
Mathematics, 28.
Donsker, M. D., & Varadhan, S. S. (1975b). Asymptotic
evaluation of certain markov process expectations for large
time, i. Communications on Pure and Applied
Mathematics, 28(1), 1–47.
Doshi-Velez, F., & Kim, B. (2017). Towards A Rigorous Science of Interpretable
Machine Learning. arXiv e-Prints.
Doucet, A., Godsill, S., & Andrieu, C. (2000). On sequential
Monte Carlo sampling methods for
Bayesian filtering. Statistics and
Computing, 10(3), 197–208.
Drucker, H., & Le Cun, Y. (1992). Improving generalization
performance using double backpropagation. IEEE Transactions
on Neural Networks, 3(6), 991–997.
Du, S. S., Zhai, X., Poczos, B., & Singh, A. (2019).
Gradient descent provably optimizes over-parameterized neural
networks. International Conference on Learning
Representations. https://openreview.net/forum?id=S1eK3i09YQ
Dua, D., & Graff, C. (2019). UCI machine
learning repository. University of California, Irvine,
School of Information; Computer Sciences. http://archive.ics.uci.edu/ml
Duda, R., & Hart, P. (1973). Pattern classification and
scene analysis. Wiley-Interscience.
Duda, R., Hart, P., & Stork, D. (2001). Pattern
classification. John Wiley & Sons.
Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., &
Zoghi, M. (2015). Contextual dueling bandits. Proceedings of
the Conference on Learning Theory (COLT).
Dudík, M., Phillips, S. J., & Schapire, R. E. (2007).
Maximum entropy density estimation with generalized
regularization and an application to species distribution
modeling. Journal of Machine Learning Research,
8.
Dupuis, P., & Ellis, R. S. (1997). A weak convergence
approach to the theory of large deviations.
Wiley-Interscience.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998).
Biological sequence analysis: Probabilistic models of
proteins and nucleic acids. Cambridge University Press.
Durrett, R. (2019). Probability: Theory and examples
(5th ed.). Cambridge University Press.
Dusenberry, M. W., Jerfel, G., Wen, Y., Ma, Y., Snoek, J.,
Heller, K., Lakshminarayanan, B., & Tran, D. (2020).
Efficient and scalable Bayesian neural nets with
rank-1 factors. arXiv Preprint arXiv:2005.07186.
Dutordoir, V., Durrande, N., & Hensman, J. (2020). Sparse
Gaussian processes with spherical harmonic
features. International Conference on Machine Learning,
2793–2802.
Duvenaud, D., Rippel, O., Adams, R., & Ghahramani, Z.
(2014). Avoiding pathologies in very deep networks.
Artificial Intelligence and Statistics, 202–210.
Dwivedi, R., Khamaru, K., Wainwright, M.
J., Jordan, M. I., et al. (2018). Theoretical guarantees
for EM under misspecified Gaussian mixture models.
Advances in Neural Information Processing Systems,
9681–9689.
Dziugaite, G. K., & Roy, D. M. (2017). Computing nonvacuous
generalization bounds for deep (stochastic) neural networks with
many more parameters than training data. arXiv Preprint
arXiv:1703.11008.
E, W., Engquist, B., Li, X., Ren, W., & Vanden-Eijnden, E.
(2004). The heterogeneous multiscale method: A review.
Eckhardt, D. E., & Lee, L. D. (1985). A theoretical basis
for the analysis of multiversion software subject to coincident
errors. IEEE Transactions on Software Engineering,
SE-11(12).
Elesedy, B. (2021). Provably strict generalisation benefit for
invariance in kernel methods. Advances in Neural Information
Processing Systems, 34.
Elesedy, B. (2022a). Group symmetry in PAC
learning. ICLR 2022 Workshop on Geometrical and Topological
Representation Learning.
Elesedy, B., & Zaidi, S. (2021a). Provably strict
generalisation benefit for equivariant models. International
Conference on Machine Learning, 2959–2969.
Elesedy, B. (2022b). Group symmetry in PAC
learning. ICLR 2022 Workshop on Geometrical and Topological
Representation Learning.
Elesedy, B., & Zaidi, S. (2021b). Provably strict
generalisation benefit for equivariant models. International
Conference on Machine Learning, 2959–2969.
Ellis, R. S. (2012). Entropy, large deviations, and
statistical mechanics (Vol. 271). Springer Science &
Business Media.
El-Yaniv, R., Fine, S., & Tishby, N. (1998). Agnostic
classification of markovian sequences. Advances in Neural
Information Processing Systems (NeurIPS).
El-Yaniv, R., & Souroujon, O. (2001). Iterative double
clustering for unsupervised and semi-supervised learning.
Advances in Neural Information Processing Systems
(NeurIPS).
Enright, A. J., Iliopoulos, I., Kyrpides, N. C., & Ouzounis,
C. A. (1999). Protein interaction maps for complete genomes
based on gene fusion events. Nature,
402(6757), 86–90.
Erven, T. van, Kotłowski, W., & Warmuth, M. K. (2014).
Follow the leader with dropout perturbations. Proceedings of
the Conference on Learning Theory (COLT).
Eskin, E., Grundy, W., & Singer, Y. (2000). Protein family
classifiction using sparse markov transducers.
ISMB2000.
Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action
elimination and stopping conditions for the multi-armed bandit
and reinforcement learning problems. Journal of Machine
Learning Research, 7.
Fard, M. M., & Pineau, J. (2010). PAC-Bayesian
model selection for reinforcement learning. Advances in
Neural Information Processing Systems (NeurIPS).
Feinholz, L. (1979). Estimation of the performance of
partitioning algorithms in pattern classification [Master’s
thesis]. Department of Mathematics, McGill University.
Feldman, V. (2020). Does learning require memorization? A short
tale about a long tail. Proceedings of the 52nd Annual ACM
SIGACT Symposium on Theory of Computing, 954–959.
Feldman, V., & Zhang, C. (2020). What neural networks
memorize and why: Discovering the long tail via influence
estimation. Advances in Neural Information Processing
Systems, 33, 2881–2891.
Feller. (1971). An introduction to probability theory and
its applications. Wiley-Interscience.
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A
discriminatively trained, multiscale, deformable part model.
IEEE Conference on Computer Vision and Pattern Recognition,
2008. CVPR 2008.
Feng, J., & Kurtz, T. G. (2006). Large deviations for
stochastic processes (Vol. 131). American Mathematical
Society. https://doi.org/10.1090/surv/131
Fey, M., & Lenssen, J. E. (2019). Fast graph
representation learning with PyTorch geometric.
Representation Learning on Graphs and Manifolds Workshop, ICLR.
https://arxiv.org/abs/1903.02428
Filippone, M., & Engler, U. (2015). Pseudo-marginal
Bayesian inference for Gaussian
processes. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 37(3), 546–560.
Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical
Hidden Markov Model:
Analysis and applications. Machine Learning,
32, 41–62.
Fisher, R. A. (1922). On the mathematical foundations of
theoretical statistics. Philosophical Transactions of the
Royal Society of London, Series A 222, 309–368.
Fisher, R. A. (1925). Theory of statistical estimation.
Trans. Cambridge Philos. Soc., 22, 700.
Flam-Shepherd, D., Requeima, J., & Duvenaud, D. (2017).
Mapping Gaussian process priors to
Bayesian neural networks. NIPS
Bayesian Deep Learning Workshop, 3.
Föll, R., & Steinwart, I. (2019).
PAC-Bayesian bounds for deep
Gaussian processes. arXiv Preprint
arXiv:1909.09985.
Foong, A. Y., Li, Y., Hernández-Lobato, J. M., & Turner, R.
E. (2019). “In-Between”
Uncertainty in Bayesian neural
networks. ICML Workshop on Uncertainty and Robustness in
Deep Learning.
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021).
Sharpness-aware minimization for efficiently improving
generalization. International Conference on Learning
Representations. https://openreview.net/forum?id=6Tm1mposlrM
Fort, S., Hu, H., & Lakshminarayanan, B. (2019). Deep
ensembles: A loss landscape perspective. arXiv Preprint
arXiv:1912.02757.
Fortunati, S., Gini, F., Greco, M. S., & Richmond, C. D.
(2017). Performance bounds for parameter estimation under
misspecified models: Fundamental findings and applications.
IEEE Signal Processing Magazine, 34(6),
142–157.
Foster, D. P., & Rakhlin, A. (2012). No internal regret via
neighborhood watch. Proceedings on the International
Conference on Artificial Intelligence and Statistics
(AISTATS).
Freedman, D. A. (1975). On tail probabilities for martingales.
The Annals of Probability, 3(1).
Freidlin, M. I., & Wentzell, A. D. (1984). Random
perturbations of dynamical systems (Vol. 260). Springer. https://doi.org/10.1007/978-1-4612-0197-6
Freitag, D. (2004). Trained named entity recognition using
distributional clusters. Proceedings of EMNLP.
Frejstrup Maibing, S., & Igel, C. (2015). Computational
complexity of linear large margin classification with ramp loss.
Proceedings on the International Conference on Artificial
Intelligence and Statistics (AISTATS).
Freund, Y., & Ron, D. (1995). Learning to model sequences
generated by switching distributions. COLT ’95: Proceedings
of the Eighth Annual Conference on Computational Learning
Theory, 8, 41–50.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new
boosting algorithm. Proceedings of the International
Conference on Machine Learning (ICML).
Freund, Y., Schapire, R., & Abe, N. (1999). A short
introduction to boosting. Journal-Japanese Society For
Artificial Intelligence, 14(771-780), 1612.
Friedman, B. (1990). Principles and techniques of applied
mathematics. Courier Dover Publications.
Friedman, J. H. (2002). Stochastic gradient boosting.
Computational Statistics & Data Analysis,
38(4), 367–378.
Friedman, N. (1998). The Bayesian structural
EM algorithm. Proceedings of the Conference on
Uncertainty in Artificial Intelligence, 129–138.
Friedman, N., & Koller, D. (2003). Being bayesian about
network structure: A bayesian approach to structure discovery in
bayesian networks. Machine Learning Journal.
Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005).
Mining data streams: A review. ACM Sigmod
Record, 34(2), 18–26.
Gabillon, V., Ghavamzadeh, M., & Lazaric, A. (2012). Best
arm identification: A unified approach to fixed budget and fixed
confidence. Advances in Neural Information Processing
Systems (NeurIPS).
Gaillard, P., Stoltz, G., & Erven, T. van. (2014). A
second-order bound with excess losses. Proceedings of the
Conference on Learning Theory (COLT).
Gajane, P., Urvoy, T., & Clérot, F. (2015). A relative
exponential weighing algorithm for adversarial utility-based
dueling bandits. Proceedings of the International Conference
on Machine Learning (ICML).
Gal, Y. (2016). Uncertainty in deep learning [PhD
thesis]. University of Cambridge.
Gal, Y., & Turner, R. (2015a). Improving the
Gaussian process sparse spectrum approximation by
representing uncertainty in frequency inputs. International
Conference on Machine Learning, 655–664.
Gal, Y., & Turner, R. E. (2015b). Improving the
Gaussian process sparse spectrum approximation by
representing uncertainty in frequency inputs. International
Conference on Machine Learning (ICML), 655–664.
Gama, J., & Rodrigues, P. P. (2009). An overview on mining
data streams. In Foundations of computational,
IntelligenceVolume 6 (pp. 29–45). Springer.
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., &
Bouchachia, A. (2014). A survey on concept drift adaptation.
ACM Computing Surveys, 46(4),
44:1–44:37.
Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., &
Wilson, A. G. (2018). GPyTorch: Blackbox
matrix-matrix Gaussian process inference with
GPU acceleration. Advances in Neural
Information Processing Systems 31.
Garivier, A., & Cappé, O. (2011). The KL-UCB
algorithm for bounded stochastic bandits and beyond.
Proceedings of the Conference on Learning Theory
(COLT).
Gastpar, M., Nachum, I., Shafer, J., & Weinberger, T.
(2024b). Fantastic generalization measures are nowhere to be
found. Proceedings of the 12th International Conference on
Learning Representations (ICLR 2024).
Gastpar, M., Nachum, I., Shafer, J., & Weinberger, T.
(2024a). Fantastic generalization measures are nowhere to be
found. International Conference on Learning
Representations.
Gelfand, A. E., & Smith, A. F. (1990). Sampling-based
approaches to calculating marginal densities. Journal of the
American Statistical Association, 85(410),
398–409.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari,
A., & Rubin, D. B. (2013). Bayesian data
analysis (3rd ed.). CRC Press.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural
networks and the bias/variance dilemma. Neural
Computation, 4(1), 1–58.
Geng, C., Wang, J., Gao, Z., Frellsen, J., & Hauberg, S.
(2021). Bounds all around: Training energy-based models with
bidirectional bounds. Advances in Neural Information
Processing Systems (NeurIPS) 34.
Genuer, R. (2012). Variance reduction in purely random forests.
Journal of Nonparametric Statistics, 24(3),
543–562.
George, T., & Merugu, S. (2005). A scalable collaborative
filtering framework based on co-clustering. Proceedings of
the Fifth IEEE International Conference on Data Mining
(ICDM?05).
Gerchinovitz, S., & Lattimore, T. (2016). Refined lower
bounds for adversarial bandits. Advances in Neural
Information Processing Systems (NeurIPS).
Germain, P., Bach, F., Lacoste, A., & Lacoste-Julien, S.
(2016). PAC-Bayesian theory meets
Bayesian inference. Advances in Neural
Information Processing Systems, 1884–1892.
Germain, P., Lacasse, A., Laviolette, F., & Marchand, M.
(2006). PAC-Bayes risk bounds for
general loss functions. Advances in Neural Information
Processing Systems (NeurIPS).
Germain, P., Lacasse, A., Laviolette, F., & Marchand, M.
(2009b). PAC-Bayesian learning of
linear classifiers. Proceedings of the 26th International
Conference on Machine Learning (ICML 2009), 353–360.
Germain, P., Lacasse, A., Laviolette, F., & Marchand, M.
(2009a). PAC-Bayesian learning of linear
classifiers. Proceedings of the International Conference on
Machine Learning (ICML).
Germain, P., Lacasse, A., Laviolette, F., Marchand, M., &
Roy, J.-F. (2015). Risk bounds for the majority vote: From a
PAC-Bayesian analysis to a learning algorithm.
Journal of Machine Learning Research, 16.
Germain, P., Lacoste, A., Laviolette, F., Marchand, M., &
Shanian, S. (2011). A PAC-Bayes Sample
Compression Approach to Kernel Methods. Proceedings
of the International Conference on Machine Learning (ICML).
Getoor, L., Friedman, N., Koller, D., Pfeffer, A., & Taskar,
B. (2007). Probabilistic relational models. In L. Getoor &
B. Taskar (Eds.), Introduction to statistical relational
learning. MIT Press.
Ghahramani, Z. (2015). Probabilistic machine learning and
artificial intelligence. Nature, 521(7553),
452–459.
Ghahramani, Z., & Attias, H. (2000). Online variational
Bayesian learning. Slides from Talk Presented
at NIPS Workshop on Online Learning.
Gilbert, A. C., Zhang, Y., Lee, K., Zhang, Y., & Lee, H.
(2017). Towards understanding the invertibility of
convolutional neural networks. 1703–1710. https://doi.org/10.24963/ijcai.2017/236
Gilbert, E. N. (1971). Codes based on inaccurate source
probabilities. IEEE Transactions on Information Theory,
17(3).
Gilks, W. R., Richardson, S., & Spiegelhalter, D. (1995).
Markov chain monte carlo in practice. CRC press.
Girard, A., Rasmussen, C. E., Quiñonero-Candela, J., &
Murray-Smith, R. (2003). Gaussian process priors
with uncertain inputs — application to multiple-step ahead time
series forecasting. Advances in Neural Information
Processing Systems 15 (NIPS 2002), 529–536.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014).
Rich feature hierarchies for accurate object detection and
semantic segmentation. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 580–587.
Glivenko, V. (1933). Sulla determinazione empirica di
probabilita. G. Inst. Ital. Attuari, 4.
Gneiting, T., & Raftery, A. E. (2007). Strictly proper
scoring rules, prediction, and estimation. Journal of the
American Statistical Association, 102, 359–378.
Goldberger, J., Greenspan, H., & Gordon, S. (2002).
Unsupervised image clustering using the information bottleneck
method. DAGM-Symposium, 158–165.
Goldman, S. A., & Kearns, M. J. (1995). On the complexity of
teaching. Journal of Computer and System Sciences,
50.
Golowich, N., Rakhlin, A., & Shamir, O. (2018).
Size-independent sample complexity of neural networks.
Conference on Learning Theory, 297–299.
Golub, G. H., & Loan, C. F. V. (1996). Matrix
computations (3rd).
The Johns Hopkins University Press.
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y.
(2016). Deep learning (Vol. 1). MIT press.
Goodfellow, I., Lee, H., Le, Q., Saxe, A., & Ng, A. (2009).
Measuring invariances in deep networks. Advances in Neural
Information Processing Systems, 22.
Gouk, H., Frank, E., Pfahringer, B., & Cree, M. J. (2021).
Regularisation of neural networks by enforcing lipschitz
continuity. Machine Learning, 110, 393–416.
Graepel, T., Herbrich, R., & Shawe-Taylor, J. (2005).
PAC-Bayesian compression bounds on the prediction
error of learning algorithms for classification. Machine
Learning, 59(1-2).
Graves, A. (2011). Practical variational inference for neural
networks. Advances in Neural Information Processing
Systems, 24, 2348–2356.
Gray, R. M. (2011). Entropy and information theory (2nd
ed.). Springer.
Grimmett, G. R., & Stirzaker, D. R. (2001). Probability
and random processes (3rd ed.). Oxford University Press.
Groot, M. D. (1970). Optimal statistical decisions.
McGraw-Hill.
Gross, L. (1967). Abstract wiener spaces.
Gross, L. (2011). Lectures on the Gaussian measure.
Mathematical Notes.
Grünwald, P. (2007a). The minimum description length
principle. MIT Press.
Grünwald, P. (2012). The safe Bayesian: Learning
the learning rate via the mixability gap. International
Conference on Algorithmic Learning Theory, 169–183.
Grünwald, P. (2018). Safe probability. Journal of
Statistical Planning and Inference, 195, 47–63.
Grünwald, P. D. (2007b). The minimum description length
principle. MIT press.
Grünwald, P. D., & Mehta, N. A. (2016). Fast rates for
general unbounded loss functions: From ERM to generalized
Bayes. arXiv Preprint arXiv:1605.00252.
Grünwald, P., & Van Ommen, T. (2017). Inconsistency of
Bayesian inference for misspecified linear models,
and a proposal for repairing it. Bayesian
Analysis, 12(4), 1069–1103.
Guedj, B. (2019). A primer on
PAC-Bayesian learning. arXiv
Preprint arXiv:1901.05353.
Gummadi, K. P., Saroiu, S., & Gribble, S. D. (2002). King:
Estimating latency between arbitrary internet end hosts.
Proceedings of the 2nd ACM
SIGCOMM Workshop on Internet Measurement (IMW-2002).
Gunasekar, S., Lee, J. D., Soudry, D., & Srebro, N. (2018).
Implicit bias of gradient descent on linear convolutional
networks. Advances in Neural Information Processing
Systems, 31.
Gunsel, B., Ferman, A., & Tekalp, A. (1998). Temporal video
segmentation using unsupervised clustering and semantic object
tracking. Journal of Electronic Imaging, 7(3),
592–604.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
calibration of modern neural networks. International
Conference on Machine Learning, 1321–1330.
Gupta, S., & Gupta, A. (2019). Dealing with noise problem in
machine learning data-sets: A systematic review. Procedia
Computer Science, 161, 466–474.
Guyon, I., Luxburg, U. von, & Williamson, R. C. (2009).
Clustering: Science or art? Towards principled approaches.
NIPS workshop.
Hajjo, R., Sabbah, D. A., Bardaweel, S. K., & Tropsha, A.
(2021). Identification of tumor-specific MRI biomarkers using
machine learning (ML). Diagnostics, 11(5),
742.
Hamilton, J. D. (1994). Time series analysis (Vol. 2).
Princeton university press Princeton, NJ.
Hansen, L. K., & Salamon, P. (1990). Neural network
ensembles. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 12(10).
Hardt, M., & Ma, T. (2017). Identity matters in deep
learning. International Conference on Learning
Representations.
Harries, M. (1999). Splice-2 comparative evaluation:
Electricity pricing (\Notype UNSW-CSE-TR-9905). School of
Computer Siene; Engineering, The University of New South Wales.
Hartigan, J. A. (1972). Direct clustering of a data matrix.
Journal of the American Statistical Association,
67(337).
Hasenclever, L., Webb, S., Lienart, T., Vollmer, S.,
Lakshminarayanan, B., Blundell, C., & Teh, Y. W. (2017).
Distributed Bayesian learning with stochastic
natural gradient expectation propagation and the posterior
server. Journal of Machine Learning Research,
18(106), 1–37.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The
elements of statistical learning: Data mining, inference, and
prediction. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
elements of statistical learning: Data mining, inference, and
prediction (Second). Springer.
Hastings, W. K. (1970). Monte carlo sampling methods using
markov chains and their applications.
Havasi, M., Hernández-Lobato, J. M., & Murillo-Fuentes, J.
J. (2018). Inference in deep Gaussian processes
using stochastic gradient Hamiltonian
Monte Carlo. Advances in Neural
Information Processing Systems, 31.
Hayes, J. D., & Pulford, D. J. (1995). The glutathione
S-transferase supergene family: Regulation of
GST and the contribution of the isoenzymes to
cancer chemoprotection and drug resistance.
Cri. Rev. Biochem. Mol. Biol., 30(6), 445–600.
Hazan, E., & Kale, S. (2009). Better algorithms for benign
bandits. Proceedings of the Annyal ACM-SIAM Symposium on
Discrete Algorithms (SODA).
Hazan, E., & Kale, S. (2011). Better algorithms for benign
bandits. Journal of Machine Learning Research,
12.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep
into rectifiers: Surpassing human-level performance on imagenet
classification. Proceedings of the IEEE International
Conference on Computer Vision, 1026–1034.
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual
learning for image recognition. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
770–778.
He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity
mappings in deep residual networks. European Conference on
Computer Vision, 630–645.
He, X., Zemel, R., & Carreira-Perpinan, M. (2004).
Multiscale conditional random fields for image labelling.
CVPR-2004: IEEE Conference on Computer Vision and Pattern
Recognition.
Heckerman, D., Geiger, D., & Chickering, D. M. (1995a).
Learning Bayesian networks: The combination of
knowledge and statistical data. Machine Learning,
20(3), 197–243.
Heckerman, D., Meek, C., & Koller, D. (2007). Probabilistic
entity-relationship models, PRMs, and plate models.
In L. Getoor & B. Taskar (Eds.), Introduction to
statistical relational learning. MIT Press.
Heckerman, D., Geiger, D., & Chickering., D. (1995b).
Learning bayesian networks: The combination of knowledge and
statistical data. Machine Learning, 20,
197–243.
Heitz, G., Gould, S., Saxena, A., & Koller, D. (2009).
Cascaded classification models: Combining models for holistic
scene understanding. In D. Koller, D. Schuurmans, Y. Bengio,
& L. Bottou (Eds.), Advances in neural information
processing systems 21.
Hensman, J., Durrande, N., Solin, A., et
al. (2017). Variational fourier features for
Gaussian processes. Journal of Machine Learning
Research, 18(1), 5537–5588.
Hensman, J., Fusi, N., & Lawrence, N. D. (2013).
Gaussian processes for big data. Conference on
Uncertainty in Artificial Intelligence (UAI), 282–290.
Hensman, J., G. Matthews, A. G. de, Filippone, M., &
Ghahramani, Z. (2015). MCMC for variationally sparse
Gaussian processes. https://arxiv.org/abs/1506.04000
Hensman, J., Matthews, A. G. de G., & Filippone, M. (2018).
Variational inference in Gaussian process models
using the differential output training conditional. arXiv
Preprint arXiv:1805.07109.
Herbster, M., & Warmuth, M. K. (1998). Tracking the best
expert. Machine Learning, 32, 151.
Herdegen, M. (2008). The theorem of bahadur and rao and large
portfolio losses. Journal of Applied Mathematics,
2011.
Herlocker, J., Konstan, J., Terveen, L., & Riedl, J. (2004).
Evaluating collaborative filtering recommender systems. ACM
Transactions on Information Systems, 22(1).
Hermes, L., Zöller, & Buhmann, J. (2002). Parametric
distributional clustering for image segmentation. European
Conference on Computer Vision.
Hernandez-Lobato, J. M., Li, Y., Rowland, M., Bui, T.,
Hernández-Lobato, D., & Turner, R. (2016). Black-box alpha
divergence minimization. International Conference on Machine
Learning, 1511–1520.
Hernández-Lobato, D., Hernández-Lobato, J., & Dupont, P.
(2011). Robust multi-class Gaussian process
classification. Advances in Neural Information Processing
Systems, 24.
Hernández-Lobato, J. M., & Adams, R. (2015). Probabilistic
backpropagation for scalable learning of Bayesian
neural networks. International Conference on Machine
Learning, 1861–1869.
Hernández-Muñoz, G., Villacampa-Calvo, C., &
Hernández-Lobato, D. (2020). Deep Gaussian
processes using expectation propagation and monte carlo methods.
Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, 479–494.
Higgs, M., & Shawe-Taylor, J. (2010). A
PAC-Bayes bound for tailored density estimation.
Proceedings of the International Conference on Algorithmic
Learning Theory (ALT).
Hinton, G., Deng, L., Yu, D., Dahl, G. E.,
Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T. N., et al. (2012). Deep neural networks for
acoustic modeling in speech recognition: The shared views of
four research groups. IEEE Signal Processing Magazine,
29, 82–97.
Hoch, T. (2015). An ensemble learning approach for the kaggle
taxi travel time prediction challenge. Proceedings of the
2015th International Conference on ECML PKDD Discovery
Challenge-Volume 1526, 52–62.
Hochreiter, S., & Schmidhuber, J. (1997). Flat minima.
Neural Computation, 9(1), 1–42.
Hochstein, S., Barlasov, A., Hershler, O., Nitzan, A., &
Shneor, S. (2004). Rapid vision is holistic. Journal of
Vision, 4(5).
Hoeffding, W. (1963). Probability inequalities for sums of
bounded random variables. Journal of the American
Statistical Association, 58(301), 13–30.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C.
T. (1999). Bayesian model averaging: A tutorial.
Statistical Science, 382–401.
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013).
Stochastic variational inference. Journal of Machine
Learning Research, 14, 1303–1347.
Hofmann, K., Bucher, P., Falquet, L., & Bairoch, A. (1999).
The PROSITE database, its status in 1999.
Nucleic Acids Research, 27(1), 215–219.
Hofmann, T. (1997). DATA CLUSTERING AND BEYOND: A
deterministic annealing framework for exploratory data
analysis. Shaker Verlag.
Hofmann, T. (1999a). Probabilistic latent semantic analysis.
UAI-1999.
Hofmann, T. (1999b). Probabilistic latent semantic indexing.
Proceedings of the Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval.
Hofmann, T., Puzicha, J., & Buhmann, J. M. (1998).
Unsupervised texture segmentation in a deterministic annealing
framework. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 20(8), 803–818.
Holland, M. (2019). PAC-Bayes under
potentially heavy tails. Advances in Neural Information
Processing Systems, 2711–2720.
Holmes, C. C., & Walker, S. G. (2015). Assigning a value to
a power series: A Bayesian interpretation.
Biometrika, 102, 497–501.
Honkela, A., & Valpola, H. (2003). On-line variational
Bayesian learning. 4th International Symposium
on Independent Component Analysis and Blind Signal
Separation, 803–808.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer
feedforward networks are universal approximators. Neural
Networks, 2(5), 359–366.
Hossein Pishro-Nik. (2018). Introduction to probability,
statistics, and random processes. https://doi.org/doi:/10.25334/Q40H8J
Hu, W., Li, Z., & Yu, D. (2020). Simple and effective
regularization methods for training on noisily labeled data with
generalization guarantee. International Conference on
Learning Representations.
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., &
Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for
free. arXiv Preprint arXiv:1704.00109.
Hytönen, T., Van Neerven, J., Veraar, M., & Weis, L. (2018).
Analysis in banach spaces: Volume II: Probabilistic methods
and operator theory (Vol. 67). Springer.
Ibrahim, J. G., & Chen, M.-H. (2000). Power prior
distributions for regression models. Statistical
Science, 46–60.
Ibrahim, J. G., Chen, M.-H., & Sinha, D. (2003). On
optimality properties of the power prior. Journal of the
American Statistical Association, 98(461),
204–213.
Immer, A., Korzepa, M., & Bauer, M. (2021). Improving
predictions of Bayesian neural nets via local
linearization. International Conference on Artificial
Intelligence and Statistics, 703–711.
Insua, D. R., & Ruggeri, F. (2012). Robust
Bayesian analysis (Vol. 152). Springer Science
& Business Media.
Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T.,
Vetrov, D., & Wilson, A. G. (2020). Subspace inference for
Bayesian deep learning. Uncertainty in
Artificial Intelligence, 1169–1179.
Jaakkola, T. S. (2001). Tutorial on variational approximation
methods. In M. Opper & D. Saad (Eds.), Advanced mean
field methods: Theory and practice (pp. 129–160). MIT
Press.
Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the
fisher kernel method to detect remote protein homologies. In
Proceedings of the Seventh International Conference on
Intelligent Systems for Molecular Biology (ISMB).
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent
kernel: Convergence and generalization in neural networks.
Advances in Neural Information Processing Systems,
31.
Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., & Kluger, Y.
(2016). Unsupervised ensemble learning with dependent
classifiers. Proceedings on the International Conference on
Artificial Intelligence and Statistics (AISTATS).
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data
clustering: A review. ACM Computing Surveys,
31(3).
Jain, S., Liu, G., Mueller, J., & Gifford, D. (2020).
Maximizing overall diversity for improved uncertainty estimates
in deep ensembles. Proceedings of the AAAI Conference on
Artificial Intelligence, 34, 4264–4271.
Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal
regret bounds for reinforcement learning. Journal of Machine
Learning Research, 11.
Jankowiak, M., Pleiss, G., & Gardner, J. R. (2019).
Parametric Gaussian process regressors. arXiv
Preprint arXiv:1910.07123.
Janzing, D., & Schölkopf, B. (2009). Algorithmic
Markov condition for probability-free causal
inference. Proceedings of the Conference on Learning Theory
(COLT).
Jaynes, E. T. (1957). Information theory and statistical
mechanics. Physical Review, 106.
Jeanmougin, F., Thompson, J., Gouy, M., Higgins, D., &
Gibson, T. (1998). Multiple sequence alignment with clustal
X. Trends Biochem. Sci., 23,
403–405.
Jewson, J., Smith, J. Q., & Holmes, C. (2018). Principles of
Bayesian inference using general divergence
criteria. Entropy, 20(6), 442.
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., &
Bengio, S. (2020b). Fantastic generalization measures and where
to find them. Proceedings of the 8th International
Conference on Learning Representations (ICLR 2020).
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., &
Bengio, S. (2020a). Fantastic generalization measures and where
to find them. International Conference on Learning
Representations.
Jiang, Z., Liu, H., Fu, B., & Wu, Z. (2017). Generalized
ambiguity decompositions for classification with applications in
active learning and unsupervised ensemble pruning.
Proceedings of the AAAI Conference on Artificial
Intelligence, 31.
Jordan, M. I. (1999). An introduction to variational methods
for graphical models. Springer.
Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical
mixtures of experts and the EM algorithm
(AIM-1440; p. 29).
K., M., Y., W., & M., J. (1999). Loopy belief propagation
for approximate inference: An empirical study. Proceedings
of the Fifteenth Conference on Uncertainty in Artificial
Intelligence.
Kaelbling, L. P. (1994). Associative reinforcement learning:
Functions in k-DNF.
Machine Learning, 15.
Kalchbrenner, N., & Blunsom, P. (2013). Recurrent continuous
translation models. Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing,
1700–1709.
Kale, S. (2014). Multiarmed bandits with limited expert advice.
Proceedings of the Conference on Learning Theory
(COLT).
Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal
exploration in multi-armed bandits. Proceedings of the
International Conference on Machine Learning (ICML).
Kárnỳ, M. (2014). Approximate Bayesian recursive
estimation. Information Sciences, 285,
100–111.
Kathuria, T., Deshpande, A., & Singh, P. (2016).
Batson–spielman–srivastava sparsification and detour ranking in
determinantal point processes. Advances in Neural
Information Processing Systems 29, 152–160.
Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson
sampling: An optimal finite time analysis. Proceedings of
the International Conference on Algorithmic Learning Theory
(ALT).
Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2022).
Generalization in deep learning. In Mathematical aspects of
deep learning. Cambridge University Press. https://doi.org/10.1017/9781009025096.003
Kearns, M. J., & Ron, D. (1999). Algorithmic stability and
sanity-check bounds for leave-one-out cross-validation.
Neural Computation, 11.
Kearns, M. J., & Vazirani, U. V. (1994). An introduction
to computational learning theory. The MIT
Press.
Kearns, M., Mansour, Y., Ng, A., & Ron, D. (1997). An
experimental and theoretical comparison of model selection
methods. Machine Learning, 27.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need
in Bayesian deep learning for computer vision?
Advances in Neural Information Processing Systems,
5574–5584.
Keren, D. (n.d.). Recognizing image style and activities in
video using local features and naive bayes.
Keshet, J., McAllester, D., & Hazan, T. (2011).
PAC-Bayesian approach for minimization
of phoneme error rate. IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP).
Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D.
(2005). Phoneme alignment based on discriminative learning.
9th
European Conference on Speech Communication and Technology
(INTERSPEECH).
Keskar, N. S., Nocedal, J., Tang, P., Mudigere, D., &
Smelyanskiy, M. (2017). On large-batch training for deep
learning: Generalization gap and sharp minima. Proceedings
of ICLR 2017.
Khan, M. E. E., Immer, A., Abedi, E., & Korzepa, M. (2019).
Approximate inference turns deep networks into
Gaussian processes. Advances in Neural
Information Processing Systems, 32, 3094–3104.
Kilbertus, N., Gomez-Rodriguez, M., Schölkopf, B., Muandet, K.,
& Valera, I. (2019). Improving consequential decision
making under imperfect predictions.
Kim, Y.-D., & Choi, S. (2007). Nonnegative
Tucker decomposition. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR).
Kingma, D. P., & Ba, J. (2015). Adam: A method
for stochastic optimization. International Conference for
Learning Representations.
Kingma, D. P., & Welling, M. (2014). Auto-encoding
variational Bayes. International Conference on
Learning Representations.
Kleijn, B. J. K., Van der Vaart, A. W., et
al. (2012). The
Bernstein-von-Mises theorem under
misspecification. Electronic Journal of Statistics,
6, 354–381.
Klenke, A. (2013). Probability theory: A comprehensive
course (2nd ed.). Springer.
Kluger, Y., Basri, R., Chang, J. T., & Gerstein, M. (2003).
Spectral biclustering of microarray data: Coclustering genes and
conditions. Genome Research.
Knoblauch, J. (2019). Robust deep Gaussian
processes. arXiv Preprint arXiv:1904.02303.
Knoblauch, J., Jewson, J., & Damoulas, T. (2019b).
Generalized variational inference. arXiv Preprint
arXiv:1904.02063.
Knoblauch, J., Jewson, J., & Damoulas, T. (2019a).
Generalized variational inference. arXiv Preprint
arXiv:1904.02063.
Knoblauch, J., Jewson, J., & Damoulas, T. (2022). An
optimization-centric view on Bayes’ rule: Reviewing
and generalizing variational inference. Journal of Machine
Learning Research, 23(132), 1–109.
Koller, D., & Friedman, N. (2009). Probabilistic
graphical models: Principles and techniques.
MIT press.
Kolmogorov, A. N. (1933a). Sulla determinazione empirica di una
leggi di distribuzione. G. Inst. Ital. Attuari,
4.
Kolmogorov, A. N. (1965). Three approaches to the quantitative
denition of information. Problems of Information and
Transmission, 1, 1–7.
Kolmogorov, A. N. (1933b). Grundbegriffe der
wahrscheinlichkeitsrechnung. Ergebnisse Der Mathematik.
Kolmogorov, A. N. (1956). Foundations of the theory of
probability (2nd English ed.). Chelsea / Addison–Wesley.
Koltchinskii, V. (2001). Rademacher penalties and structural
risk minimization. IEEE Transactions on Information
Theory.
Komiyama, J., Honda, J., Kashima, H., & Nakagawa, H. (2015).
Regret lower bound and optimal algorithm in dueling bandit
problem. Proceedings of the Conference on Learning Theory
(COLT).
Kontorovich, A., & Raginsky, M. (2017). Concentration of
measure without independence: A unified approach via the
martingale method. In Convexity and concentration.
Springer, New York, NY.
Koolen, W. M., & Erven, T. van. (2015). Second-order
quantile methods for experts and combinatorial games.
Proceedings of the Conference on Learning Theory
(COLT).
Koolen, W. M., Erven, T. van, & Grünwald, P. (2014).
Learning the learning rate for prediction with expert advice.
Advances in Neural Information Processing Systems
(NeurIPS).
Krause, A., & Ong, C. S. (2011). Contextual Gaussian process bandit
optimization. Advances in Neural Information Processing
Systems (NeurIPS).
Krichevskiy, R. E. (1998). Laplace?s law of succession and
universal encoding. IEEE Transactions on Information
Theory, 44(1).
Krichevsky, R. E., & Trofimov, V. K. (1981). The performance
of universal coding. IEEE Transactions on Information
Theory, IT-27, 199–207.
Krizhevsky, A., Hinton, G., et al.
(2009). Learning multiple layers of features from tiny
images. Toronto, ON, Canada.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks.
Advances in Neural Information Processing Systems,
1097–1105.
Krogh, A., & Hertz, J. (1991). A simple weight decay can
improve generalization. Advances in Neural Information
Processing Systems, 4.
Krogh, A., & Vedelsby, J. (1994). Neural network ensembles,
cross validation and active learning. Proceedings of the 7th
International Conference on Neural Information Processing
Systems, NIPS’94, 231–238.
Krupka, E. (2008). Generalization from observed to
unobserved features [PhD thesis]. The Hebrew University of
Jerusalem.
Krupka, E., & Tishby, N. (2005). Generalization in
clustering with unobserved features. Advances in Neural
Information Processing Systems (NeurIPS).
Krupka, E., & Tishby, N. (2008). Generalization from
observed to unobserved features by clustering. Journal of
Machine Learning Research, 9.
Kullback, S., & Leibler, R. (1951). On information and
sufficiency. Annals of Mathematical Statistics,
22.
Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of
diversity in classifier ensembles and their relationship with
the ensemble accuracy. Machine Learning,
51(2), 181–207.
Kuo, H.-H. (2006). Gaussian measures in banach
spaces. In Gaussian measures in banach
spaces (pp. 1–109). Springer.
Kveton, B., Wen, Z., Ashkan, A., & Szepesvári, C. (2015).
Tight regret bounds for stochastic combinatorial semi-bandits.
Proceedings on the International Conference on Artificial
Intelligence and Statistics (AISTATS).
Lacasse, A., Laviolette, F., Marchand, M., Germain, P., &
Usunier, N. (2007). PAC-Bayes bounds
for the risk of the majority vote and the variance of the
Gibbs classifier. Advances in Neural
Information Processing Systems (NeurIPS).
Lafferty, J., McCallum, A., & Pereira, F. (2001).
Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. Proceedings of the
International Conference on Machine Learning (ICML).
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017).
Simple and scalable predictive uncertainty estimation using deep
ensembles. Advances in Neural Information Processing
Systems, 6402–6413.
Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004).
Stability based validation of clustering solutions. Neural
Computation.
Langford, J. (2005). Tutorial on practical prediction theory for
classification. Journal of Machine Learning Research,
6.
Langford, J., & Seeger, M. (2001). Bounds for averaging
classifiers. Citeseer.
Langford, J., & Shawe-Taylor, J. (2002).
PAC-Bayes & margins. Advances in Neural
Information Processing Systems (NeurIPS).
Langford, J., & Zhang, T. (2007). The epoch-greedy algorithm
for contextual multi-armed bandits. Advances in Neural
Information Processing Systems (NeurIPS).
Lashkari, D., & Golland, P. (2009). Co-clustering with
generative models. MIT-CSAIL-TR-2009-054.
Lauritzen, S. L. (1992). Propagation of probabilities, means,
and variances in mixed graphical association models. Journal
of the American Statistical Association, 87(420),
1098–1108.
Lauritzen, S. L. (1996). Graphical models. Clarendon
Press (Oxford University Press).
Laviolette, F., & Marchand, M. (2005).
PAC-Bayes risk bounds for
sample-compressed Gibbs classifiers.
Proceedings of the International Conference on Machine
Learning (ICML).
Laviolette, F., & Marchand, M. (2007).
PAC-Bayes risk bounds for stochastic averages and
majority votes of sample-compressed classifiers. Journal of
Machine Learning Research, 8.
Laviolette, F., Marchand, M., & Roy, J.-F. (2011). From
PAC-Bayes bounds to quadratic programs for majority
votes. Proceedings of the International Conference on
Machine Learning (ICML).
Laviolette, F., Morvant, E., Ralaivola, L., & Roy, J.-F.
(2017). Risk upper bounds for general ensemble methods with an
application to multiclass classification.
Neurocomputing, 219.
Lawrence, N. D. (2001). Variational inference in
probabilistic models [PhD thesis]. Citeseer.
Lawrence, N. D., & Moore, A. J. (2007). Hierarchical
Gaussian process latent variable models.
Proceedings of the 24th International Conference on Machine
Learning, 481–488.
Lázaro-Gredilla, M. (2010a). Inter-domain Gaussian
processes for sparse inference using inducing features.
Advances in Neural Information Processing Systems 23,
1087–1095.
Lázaro-Gredilla, M. (2010b). Sparse spectrum
Gaussian process regression. Journal of Machine
Learning Research, 11, 1865–1881.
Le, Q., Sarlós, T., & Smola, A. (2013).
Fastfood—approximating kernel expansions in loglinear time.
International Conference on Machine Learning, 244–252.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R.
E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation
applied to handwritten zip code recognition. Neural
Computation, 1(4), 541–551.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.
Lee, D. D., & Seung, H. S. (1999). Learning the parts of
objects by non-negative matrix factorization. Nature,
401.
Lee, D. D., & Seung, H. S. (2001). Algorithms for
non-negative matrix factorization. Advances in Neural
Information Processing Systems (NeurIPS).
Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington,
J., & Sohl-Dickstein, J. (2018). Deep neural networks as
Gaussian processes. 6th International
Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings.
Lee, J., Feng, J., Humt, M., Müller, M. G., & Triebel, R.
(2022). Trust your robots! Predictive uncertainty estimation of
neural networks with sparse Gaussian processes.
Conference on Robot Learning, 1168–1179.
Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall,
D. J., & Batra, D. (2016). Stochastic multiple choice
learning for training diverse deep ensembles. Advances in
Neural Information Processing Systems (NeurIPS).
Leibig, C., Allken, V., Ayhan, M. S., Berens, P., & Wahl, S.
(2017). Leveraging uncertainty information from deep neural
networks for disease detection. Scientific Reports,
7, 1–14.
Letarte, G., Germain, P., Guedj, B., & Laviolette, F.
(2019). Dichotomize and generalize: PAC-Bayesian
binary activated deep neural networks. Advances in Neural
Information Processing Systems, 6869–6879.
Letham, B., Rudin, C., McCormick, T. H., & Madigan, D.
(2015). Interpretable classifiers using rules and
Bayesian analysis: Building a better stroke
prediction model. The Annals of Applied Statistics,
9(3), 1350–1371.
Lever, G., Laviolette, F., & Shawe-Taylor, J. (2010).
Distribution-dependent PAC-Bayes priors.
Proceedings of the International Conference on Algorithmic
Learning Theory (ALT).
Lever, G., Laviolette, F., & Shawe-Taylor, J. (2013).
Tighter PAC-Bayes bounds through
distribution-dependent priors. Theoretical Computer
Science, 473.
Levine, E., & Domany, E. (2001). Resampling method for
unsupervised estimation of cluster validity. Neural
Computation.
Levinson, J., Askeland, J., Becker, J.,
Dolson, J., Held, D., Kammel, S., Kolter, J. Z., Langer, D.,
Pink, O., Pratt, V., et al. (2011). Towards fully
autonomous driving: Systems and algorithms. 2011 IEEE
Intelligent Vehicles Symposium (IV), 163–168.
Li, H., & Abe, N. (1998). Word clustering and disambiguation
based on co-occurrence data. Proceedings of the 17th
International Conference on Computational Linguistics.
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T.
(2017). Visualizing the loss landscape of neural nets. arXiv
Preprint arXiv:1712.09913.
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A
contextual-bandit approach to personalized news article
recommendation. Proceedings of the International Conference
on World Wide Web (WWW).
Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased
offline evaluation of contextual-bandit-based news article
recommendation algorithms. Proceedings of the ACM
International Conference on Web Search and Data Mining.
Li, X., & Orabona, F. (2019). On the convergence of
stochastic gradient descent with adaptive stepsizes. The
22nd International Conference on Artificial Intelligence and
Statistics, 983–992.
Li, Y., & Gal, Y. (2017). Dropout inference in
Bayesian neural networks with alpha-divergences.
International Conference on Machine Learning,
2052–2061.
Li, Y., Hernández-Lobato, J. M., & Turner, R. E. (2015).
Stochastic expectation propagation. Advances in Neural
Information Processing Systems, 28.
Li, Y., & Liu, Q. (2016). Wild variational approximations.
NIPS Workshop on Advances in Approximate
Bayesian Inference.
Li, Y., & Turner, R. E. (2016a). Rényi divergence
variational inference. Advances in Neural Information
Processing Systems, 29.
Li, Y., & Turner, R. E. (2016b). Rényi divergence
variational inference. Advances in Neural Information
Processing Systems 28, 1073–1081.
Liang, T., Poggio, T., Rakhlin, A., & Stokes, J. (2019).
Fisher-rao metric, geometry, and complexity of neural networks.
The 22nd International Conference on Artificial Intelligence
and Statistics, 888–896.
Liao, J., & Berg, A. (2019). Sharpening
Jensen’s inequality. The American
Statistician, 73(3), 278–281.
Lifshits, M. (2012). Gaussian random
functions. Springer.
Lin, J. A., Antorán, J., Padhy, S., Janz, D., Hernández-Lobato,
J. M., & Terenin, A. (2024). Sampling from
Gaussian process posteriors using stochastic
gradient descent. Advances in Neural Information Processing
Systems, 36.
Littlestone, N., & Warmuth, M. K. (1994). The weighted
majority algorithm. Information and Computation,
108.
Littlewood, B., & Miller, D. R. (1989). Conceptual modeling
of coincident failures in multiversion software. IEEE
Transactions on Software Engineering, 15(12),
1596–1614.
Liu, J. Z., Padhy, S., Ren, J., Lin, Z., Wen, Y., Jerfel, G.,
Nado, Z., Snoek, J., Tran, D., & Lakshminarayanan, B.
(2023). A simple approach to improve single-model deep
uncertainty via distance-awareness. Journal of Machine
Learning Research, 24, 1–63.
Liu, J., Paisley, J., Kioumourtzoglou, M.-A., & Coull, B.
(2019). Accurate uncertainty estimation and decomposition in
ensemble learning. Advances in Neural Information Processing
Systems, 8950–8961.
Liu, Q., & Wang, D. (2016). Stein variational gradient
descent: A general purpose Bayesian inference
algorithm. Advances in Neural Information Processing
Systems, 2378–2386.
Liu, Y., & Yao, X. (1999). Ensemble learning via negative
correlation. Neural Networks, 12(10),
1399–1404.
Livni, R., Shalev-Shwartz, S., & Shamir, O. (2014). On the
computational efficiency of training neural networks.
Advances in Neural Information Processing Systems,
27.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE
Transactions on Information Theory, 28(2),
129–137.
London, B., Huang, B., Taskar, B., & Getoor, L. (2014).
PAC-Bayesian collective stability. Proceedings
on the International Conference on Artificial Intelligence and
Statistics (AISTATS).
Lorenzen, S. S., Igel, C., & Seldin, Y. (2019a). On
PAC-Bayesian bounds for random
forests. Machine Learning, 108(8-9),
1503–1522.
Lorenzen, S. S., Igel, C., & Seldin, Y. (2019b). On
PAC-Bayesian bounds for random forests. Machine
Learning, 108(8-9).
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay
regularization. arXiv Preprint arXiv:1711.05101.
Lu, Z., Wu, X., Zhu, X., & Bongard, J. (2010). Ensemble
pruning via individual contribution ordering. Proceedings of
the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 871–880.
Luo, H., & Schapire, R. E. (2015). Achieving all with no
parameters: AdaNormalHedge. Proceedings of the Conference on
Learning Theory (COLT).
Luxburg, U. von, & Ben-David, S. (2005). Towards a
statistical theory of clustering. PASCAL Workshop on
Statistics and Optimization of Clustering.
Lyddon, S., Walker, S., & Holmes, C. C. (2018).
Nonparametric learning from Bayesian models with
randomized objective functions. Advances in Neural
Information Processing Systems, 2071–2081.
Lykouris, T., Mirrokni, V., & Leme, R. P. (2018). Stochastic
bandits robust to adversarial corruptions. Proceedings of
the Annual ACM SIGACT Symposium on Theory of Computing.
Lyle, C., Wilk, M. van der, & Kabán, A. (2020a). The
benefits of invariance in neural networks. arXiv Preprint
arXiv:2005.00178.
Lyle, C., Wilk, M. van der, Kwiatkowska, M., Gal, Y., &
Bloem-Reddy, B. (2020b). On the benefits of invariance in neural
networks. arXiv Preprint arXiv:2005.00178.
Ma, C., & Hernández-Lobato, J. M. (2021). Functional
variational inference based on stochastic process generators.
Advances in Neural Information Processing Systems,
34, 21795–21807.
Ma, C., Li, Y., & Hernández-Lobato, J. M. (2019).
Variational implicit processes. International Conference on
Machine Learning (ICML), 4222–4233.
Ma, S., Bassily, R., & Belkin, M. (2018). The power of
interpolation: Understanding the effectiveness of SGD in modern
over-parametrized learning. International Conference on
Machine Learning, 3325–3334.
Ma, Y., & Deisenroth, M. P. (2019). A variational
Bayesian treatment of implicit processes.
Statistics and Computing, 29, 1145–1165.
MacKay, D. J. C. (1992a). A practical Bayesian
framework for backpropagation networks. Neural
Computation, 4, 448–472.
MacKay, D. J. C. (1992b). Bayesian interpolation.
Neural Computation, 4(3), 415–447. https://doi.org/10.1162/neco.1992.4.3.415
MacKay, D. J. C. (1992c). The evidence framework applied to
classification networks. Neural Computation,
4(5), 720–736.
MacKay, D. J. C. (1995). Probable networks and plausible
predictions-a review of practical Bayesian methods
for supervised neural networks. Network: Computation in
Neural Systems, 6(3), 469.
MacKay, D. J. C. (2003). Information theory, inference, and
learning algorithms. Cambridge University Press.
Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., &
Wilson, A. G. (2019). A simple baseline for
Bayesian uncertainty in deep learning. Advances
in Neural Information Processing Systems, 32,
13153–13164.
Maddox, W., Garipov, T., Izmailov, P., & Wilson, A. G.
(2020). A simple baseline for Bayesian uncertainty
in deep learning. Advances in Neural Information Processing
Systems 33, 13153–13164.
Madeira, S. C., & Oliveira, A. L. (2004). Biclustering
algorithms for biological data analysis: A survey. IEEE/ACM
Transactions on Computational Biology and Bioinformatics
(TCBB), 1(1).
Magureanu, S., Combes, R., & Proutiere, A. (2017). Minimal
exploration in structured stochastic bandits. Advances in
Neural Information Processing Systems (NeurIPS).
Maillard, O.-A. (2011). Apprentissage
séquentiel: Bandits, statistique et
renforcement [PhD thesis]. INRIA Lille.
Maillard, O.-A., Munos, R., & Stoltz, G. (2011). A
finite-time analysis of multi-armed bandits problems with
Kullback-Leibler divergences. Proceedings of
the Conference on Learning Theory (COLT).
Maintainers, T., & Contributors. (2016). TorchVision:
PyTorch’s computer vision library. In GitHub
repository. https://github.com/pytorch/vision; GitHub.
Mandal, M. K., Panchanathan, S., & Aboulnasr, T. (1995).
Choice of wavelets for image compression. Information Theory
and Applications, 239–249. citeseer.nj.nec.com/138157.html
Mandt, S., Hoffman, M. D., & Blei, D. M. (2017). Stochastic
gradient descent as approximate Bayesian inference.
Journal of Machine Learning Research, 18(134),
1–35. https://jmlr.org/papers/v18/17-214.html
Mannor, S., & Shamir, O. (2011). From bandits to experts: On
the value of side-observations. Advances in Neural
Information Processing Systems (NeurIPS).
Mannor, S., & Tsitsiklis, J. N. (2004). The sample
complexity of exploration in the multi-armed bandit problem.
Journal of Machine Learning Research, 5.
Mansour, Y., & McAllester, D. (2000). Generalization bounds
for decision trees. Proceedings of the Conference on
Learning Theory (COLT).
Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates,
T. O., & Eisenberg, D. (1999). Detecting protein function
and protein-protein interactions from genome sequences.
Science, 285(5428), 751–753.
Martens, J. (2020). New insights and perspectives on the natural
gradient method. The Journal of Machine Learning
Research, 21, 5776–5851.
Martens, J., & Grosse, R. (2015). Optimizing neural networks
with kronecker-factored approximate curvature. International
Conference on Machine Learning, 2408–2417.
Marton, K. (1996). A measure concentration inequality for
contracting Markov chains. Geometric and
Functional Analysis, 6(3).
Marton, K. (1997). A measure concentration inequality for
contracting Markov chains Erratum.
Geometric and Functional Analysis, 7(3).
Masegosa, A. R., Martı́nez, A. M., Langseth, H., Nielsen, T. D.,
Salmerón, A., Ramos-López, D., & Madsen, A. L. (2016a).
D-VMP: Distributed variational message passing.
PGM’2016. JMLR: Workshop and Conference Proceedings,
52, 321–332.
Masegosa, A. R., Martinez, A. M., & Borchani, H. (2016b).
Probabilistic graphical models on multi-core CPUs
using Java 8. IEEE Computational Intelligence
Magazine, 11(2), 41–54.
Masegosa, A. R. (2020). Learning under model misspecification:
Applications to variational and ensemble methods. Advances
in Neural Information Processing Systems.
Masegosa, A. R., Cabañas, R., Langseth, H., Nielsen, T. D.,
& Salmerón, A. (2019). Probabilistic models with deep neural
networks. arXiv Preprint arXiv:1908.03442.
Masegosa, A. R., Lorenzen, S., Igel, C., & Seldin, Y.
(2020). Second order PAC-Bayesian bounds for the
weighted majority vote. Advances in Neural Information
Processing Systems, 33, 5263–5273.
Masegosa, A. R., Martinez, A. M., Langseth, H., Nielsen, T. D.,
Salmerón, A., Ramos-López, D., & Madsen, A. L. (2017a).
Scaling up Bayesian variational inference using
distributed computing clusters. International Journal of
Approximate Reasoning, 88, 435–451.
Masegosa, A. R., Martı́nez, A. M., Ramos-López, D., Cabañas, R.,
Salmerón, A., Nielsen, T. D., Langseth, H., & Madsen, A. L.
(2017b). AMIDST: A Java toolbox for
scalable probabilistic machine learning. arXiv Preprint
arXiv:1704.01427.
Masegosa, A. R., Nielsen, T. D., Langseth, H., Ramos-López, D.,
Salmerón, A., & Madsen, A. L. (2017c). Bayesian
models of data streams with hierarchical power priors.
International Conference on Machine Learning,
2334–2343.
Matthews, A. G. de G., Hensman, J., Turner, R. E., &
Ghahramani, Z. (2017a). On the convergence and robustness of
sparse variational Gaussian process regression.
Advances in Neural Information Processing Systems 30,
2394–2403.
Matthews, A. G. de G., Wilk, M. van der, Nickson, T., Fujii, K.,
Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., &
Hensman, J. (2017b). GPflow: A
Gaussian process library using TensorFlow.
Journal of Machine Learning Research, 18(40),
1–6.
Maurer, A. (2004). A note on the
PAC-Bayesian theorem.
www.arxiv.org.
Maurer, A., & Pontil, M. (2009). Empirical
Bernstein bounds and sample variance penalization.
Proceedings of the Conference on Learning Theory
(COLT).
McAllester, D. (1999a). PAC-Bayesian model
averaging. Proceedings of the Conference on Learning Theory
(COLT).
McAllester, D. (1999b). Some PAC-Bayesian theorems.
Machine Learning, 37.
McAllester, D. (2003a). PAC-Bayesian stochastic
model selection. Machine Learning, 51.
McAllester, D. (2003b). Simplified PAC-Bayesian
margin bounds. Proceedings of the Conference on Learning
Theory (COLT).
McAllester, D. (2007). Generalization bounds and consistency for
structured labeling. In G. Bakir, T. Hofmann, B. Schölkopf, A.
Smola, B. Taskar, & S. V. N. Vishwanathan (Eds.),
Predicting structured data. MIT Press.
McAllester, D. A. (1998). Some PAC-Bayesian
theorems. Proceedings of the Conference on Learning Theory
(COLT).
McAllester, D. A. (1999c). PAC-Bayesian model
averaging. Proceedings of the Twelfth Annual Conference on
Computational Learning Theory, 164–170.
McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M., Shah,
A., Cipolla, R., & Weller, A. (2017). Concrete problems
for autonomous vehicle safety: Advantages of
Bayesian deep learning.
McDiarmid, C. (1989). On the method of bounded differences.
Surveys in Combinatorics, 148–188.
McInerney, J., Ranganath, R., & Blei, D. (2015). The
population posterior and Bayesian modeling on
streams. In Advances in neural information processing
systems 28 (pp. 1153–1161). Curran Associates, Inc.
McLachlan, G., & Krishnan, T. (1997). The EM algorithm
and extensions.
McMahan, H. B., & Streeter, M. (2009). Tighter bounds for
multi-armed bandits with expert advice. Proceedings of the
Conference on Learning Theory (COLT).
Medasani, S., & Krishnapuram, R. (2001). Categorization of
image databases for efficient retrieval using robust mixture
decomposition. Computer Vision and Image Understanding:
CVIU, 83(3), 216–235. citeseer.nj.nec.com/231614.html
Mei, S., Misiakiewicz, T., & Montanari, A. (2021). Learning
with invariances in random features and kernel models.
Conference on Learning Theory, 3351–3418.
Meila, M., & Jordan, M. I. (2000). Learning with mixtures of
trees. Journal of Machine Learning Research,
1, 1–48.
Mercer, J. (1909a). Functions ofpositive and negativetypeand
theircommection with the theory ofintegral equations.
Philos. Trinsdictions Rogyal Soc, 209, 4–415.
Mercer, J. (1909b). Functions of positive and negative type, and
their connection with the theory of integral equations.
Philosophical Transactions of the Royal Society A,
209, 415–446.
Micchelli, C. A., & Pontil, M. (2006a). On learning
vector-valued functions. Advances in Neural Information
Processing Systems 17, 961–968.
Micchelli, C. A., & Pontil, M. (2006b). Universal kernels.
Advances in Neural Information Processing Systems
(NeurIPS), 18, 653–660.
Minka, T. P. (2000). Bayesian model averaging is
not model combination. Available Electronically at
Http://Www. Stat. Cmu. Edu/Minka/Papers/Bma. Html, 1–2.
Minka, T. P. (2005). Divergence measures and message
passing (MSR-TR-2005-173; p. 17).
Minka, T. P. (2013). Expectation propagation for approximate
Bayesian inference. arXiv Preprint
arXiv:1301.2294.
Minka, T. P. (2001). Expectation propagation for approximate
Bayesian inference (MSR–TR–2001–41). Microsoft
Research.
Minsker, S., Srivastava, S., Lin, L., & Dunson, D. B.
(2017). Robust and scalable Bayes via a median of
subset posterior measures. Journal of Machine Learning
Research, 18(124), 1–40.
Mitchell, T. (1997). Machine learning. McGraw-Hill.
Mitzenmacher, M., & Upfal, E. (2005). Probability and
computing: Randomized algorithms and probabilistic
analysis. Cambridge University Press.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou,
I., Wierstra, D., & Riedmiller, M. (2013). Playing atari
with deep reinforcement learning. In NIPS deep learning
workshop.
Mnih, V., Szepesvári, C., & Audibert, J.-Y. (2008).
Empirical Bernstein stopping. Proceedings of
the International Conference on Machine Learning (ICML).
Mohamed, S., Rosca, M., Figurnov, M., & Mnih, A. (2019).
Monte Carlo gradient estimation in machine
learning. arXiv Preprint arXiv:1906.10652.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012).
Foundations of machine learning. MIT Press.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018).
Foundations of machine learning. MIT press.
Mucsányi, B., Kirchhof, M., & Oh, S. J. (2024). Benchmarking
uncertainty disentanglement: Specialized uncertainties for
specialized tasks. ICML 2024 Workshop on Structured
Probabilistic Inference & Generative
ModelingL.
Murphy, K. P. (2012). Machine learning: A probabilistic
perspective. MIT Press.
Nabarro, S., Ganev, S., Garriga-Alonso, A., Fortuin, V., Wilk,
M. van der, & Aitchison, L. (2022). Data augmentation in
bayesian neural networks and the cold posterior effect.
Proceedings of the Conference on Uncertainty in Artificial
Intelligence.
Nagarajan, V. (2021). Explaining generalization in deep
learning: Progress and fundamental limits. arXiv Preprint
arXiv:2110.08922.
Nagarajan, V., & Kolter, J. Z. (2017). Generalization in
deep networks: The role of distance from initialization.
NeurIPS Workshop on Deep Learning: Bridging Theory and
Practice.
Nagarajan, V., & Kolter, J. Z. (2019a). Deterministic
PAC-Bayesian generalization bounds for
deep networks via generalizing noise-resilience.
International Conference on Learning Representations.
Nagarajan, V., & Kolter, J. Z. (2019b). Uniform convergence
may be unable to explain generalization in deep learning.
Advances in Neural Information Processing Systems,
11611–11622.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., &
Sutskever, I. (2020). Deep double descent: Where bigger models
and more data hurt. arXiv Preprint arXiv:1912.02292.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., &
Sutskever, I. (2021). Deep double descent: Where bigger models
and more data hurt. Journal of Statistical Mechanics: Theory
and Experiment, 2021(12), 124003.
Nalisnick, E., Hernández-Lobato, J. M., & Smyth, P. (2019).
Dropout as a structured shrinkage prior. International
Conference on Machine Learning, 4712–4722.
Namkoong, H., & Duchi, J. C. (2017). Variance-based
regularization with convex objectives. Advances in Neural
Information Processing Systems, 30.
Neal, R. M. (1993). Probabilistic inference using markov
chain monte carlo methods (CRG–TR–93–1). Department of
Computer Science, University of Toronto.
Neal, R. M. (1996). Bayesian learning for neural
networks (Vol. 118). Springer Science & Business Media.
Neal, R. M. (2012). Bayesian learning for
neural networks (Vol. 118). Springer Science & Business
Media.
Negrea, J., Dziugaite, G. K., & Roy, D. (2020). In defense
of uniform convergence: Generalization via derandomization with
an application to interpolating predictors. International
Conference on Machine Learning, 7263–7272.
Nesterov, Y. (2003). Introductory lectures on convex
optimization: A basic course. Springer.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., &
Ng, A. Y. (2011). Reading digits in natural images with
unsupervised feature learning. NIPS Workshop on Deep
Learning and Unsupervised Feature Learning. http://ufldl.stanford.edu/housenumbers
Neu, G. (2015). Explore no more: Improved high-probability
regret bounds for non-stochastic bandits. Advances in Neural
Information Processing Systems (NeurIPS).
Neyman, J., & Pearson, E. S. (1933). On the problem of the
most efficient tests of statistical hypotheses.
Philosophical Transactions of the Royal Society of London.
Series A, Containing Papers of a Mathematical or Physical
Character, 231.
Neyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro,
N. (2017a). Exploring generalization in deep learning.
Advances in Neural Information Processing Systems,
30.
Neyshabur, B., Bhojanapalli, S., & Srebro, N. (2017b). A
PAC-Bayesian approach to
spectrally-normalized margin bounds for neural networks.
International Conference on Learning Representations.
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., &
Srebro, N. (2019). The role of over-parametrization in
generalization of neural networks. International Conference
on Learning Representations. https://openreview.net/forum?id=BygfghAcYX
Neyshabur, B., Salakhutdinov, R. R., & Srebro, N. (2015a).
Path-sgd: Path-normalized optimization in deep neural networks.
Advances in Neural Information Processing Systems,
28.
Neyshabur, B., Tomioka, R., & Srebro, N. (2015b). In search
of the real inductive bias: On the role of implicit
regularization in deep learning. arXiv Preprint
arXiv:1412.6614.
Neyshabur, B., Tomioka, R., & Srebro, N. (2015c). Norm-based
capacity control in neural networks. Proceedings of the 28th
International Conference on Learning Theory (COLT),
1376–1401.
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral
clustering: Analysis and an algorithm. Advances in Neural
Information Processing Systems (NeurIPS).
Nitzan, S., & Paroush, J. (1982). Optimal decision rules in
uncertain dichotomous choice situations. International
Economic Review, 23(2).
Nocedal, J., & Wright, S. J. (2006). Numerical
optimization (2nd ed.). Springer.
Norgeot, B., Glicksberg, B. S., & Butte, A. J. (2019). A
call for deep-learning healthcare. Nature Medicine,
25(1), 14–15.
Novak, R., Sohl-Dickstein, J., & Schoenholz, S. S. (2022).
Fast finite width neural tangent kernel. International
Conference on Machine Learning, 17018–17044.
Novikov, A., & Izmailov, P. (2018). Tensor train kernel
trick. Neural Networks, 104, 1–19.
Olesen, K. G., Lauritzen, S. L., & Jensen, F. V. (1992).
AHUGIN: A system creating adaptive
causal probabilistic networks. Proceedings of the Eighth
International Conference on Uncertainty in Artificial
Intelligence, 223–229.
Opper, M., & Archambeau, C. (2009). The variational
Gaussian approximation revisited. Neural
Computation, 21(3), 786–792. https://doi.org/10.1162/neco.2008.06-08-804
Ortega, L. A., Rodriguez-Santana, S., & Hernández-Lobato, D.
(2024a). Variational linearized Laplace
approximation for Bayesian deep learning.
International Conference on Machine Learning,
38815–38836.
Ortega, L. A., Rodriguez-Santana, S., & Hernández-Lobato, D.
(2023). Deep variational implicit processes. International
Conference of Learning Representations.
Ortega, L. A., Rodrı́guez-Santana, S., & Hernández-Lobato, D.
(2024b). Fixed-mean Gaussian processes for post-hoc
Bayesian deep learning. arXiv Preprint
arXiv:2412.04177.
Ortner, R. (2013). Adaptive aggregation for reinforcement
learning in average reward Markov decision
processes. Annals of Operations Research.
Osawa, K., Swaroop, S., Khan, M. E. E., Jain, A., Eschenhagen,
R., Turner, R. E., & Yokota, R. (2019). Practical deep
learning with Bayesian principles. Advances in
Neural Information Processing Systems, 4289–4301.
Ottucsák, G., & György, A. (2006). The combination of
the label efficient and the multi-armed bandit problem in
adversarial setting.
http://citeseerx.ist.psu.edu/viewdoc/versions?doi=10.1.1.126.1228.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin,
S., Dillon, J., Lakshminarayanan, B., & Snoek, J. (2019).
Can you trust your model’s uncertainty? Evaluating predictive
uncertainty under dataset shift? Advances in Neural
Information Processing Systems, 13969–13980.
Ozkan, E., Smidl, V., Saha, S., Lundquist, C., & Gustafsson,
F. (2013). Marginalized adaptive particle filtering for
nonlinear models with unknown time-varying noise parameters.
Automatica, 49(6), 1566–1575.
P., G. A., T., S. P., M., K. C., O., C.-H., B., E. M., G., S.,
D., B., & O., B. P. (2000). Genomic expression programs in
the response of yeast cells to environmental changes.
Molecular Biology. Cell, 11(12), 4241–4257.
Pang, T., Xu, K., Du, C., Chen, N., & Zhu, J. (2019).
Improving adversarial robustness via promoting ensemble
diversity. International Conference on Machine
Learning, 4970–4979.
Paninski, L. (2003). Estimation of entropy and mutual
information. Neural Computation.
Paninski, L. (2004). Variational minimax estimation of discrete
distributions under KL loss. Advances in Neural Information
Processing Systems (NeurIPS).
Papadimitriou, S., Sun, J., & Faloutsos, C. (2005).
Streaming pattern discovery in multiple time-series.
Proceedings of the 31st International Conference on Very
Large Data Bases, 697–708.
Patel, A. B., Nguyen, M. T., & Baraniuk, R. (2016). A
probabilistic framework for deep learning. Advances in
Neural Information Processing Systems, 29.
Pearl, J. (1988). Probabilistic reasoning in intelligent
systems: Networks of plausible inference. San Mateo, CA:
Morgan Kaufman Publishers.
Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., &
Szepesvári, C. (2020). Tighter risk certificates for neural
networks. arXiv Preprint arXiv:2007.12911.
Perrone, V., Jenkins, P. A., Spano, D., & Teh, Y. W. (2017).
Poisson random fields for dynamic feature models. Journal of
Machine Learning Research, 18(127), 1–45.
Petersen, P., & Voigtlaender, F. (2020). Equivalence of
approximation by convolutional neural networks and
fully-connected networks. Proceedings of the American
Mathematical Society, 148(4), 1567–1581.
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., &
Ganguli, S. (2016). Exponential expressivity in deep neural
networks through transient chaos. Advances in Neural
Information Processing Systems, 29.
Popper, K. (1934). Logik der forschung.
Puurula, A., Read, J., & Bifet, A. (2014). Kaggle LSHTC4
winning solution. arXiv Preprint arXiv:1405.0546.
Quiñonero-Candela, J., & Rasmussen, C. E. (2005). A unifying
view of sparse approximate Gaussian process
regression. The Journal of Machine Learning Research,
6, 1939–1959.
Rabiner, L. R. (1989). Tutorial on hidden Markov
models and selected applications in speech recognition.
Proceedings of the IEEE, 77, 257–286.
Rabiner, L. R., & Juang, B.-H. (1986). An introduction to
hidden markov models. Ieee Assp Magazine,
3(1), 4–16.
Rahimi, A., & Recht, B. (2007). Random features for
large-scale kernel machines. Advances in Neural Information
Processing Systems 20, 1177–1184.
Ralaivola, L., Szafranski, M., & Stempfel, G. (2010).
Chromatic PAC-Bayes bounds for non-IID
data: Applications to ranking and stationary β-mixing processes.
Journal of Machine Learning Research.
Ramanan, A. K. A. K. (2008). Concentration inequalities for
dependent random variables via the martingale method. Annals
of Probability, 36(6).
Rasmussen, C. E. (2003). Gaussian processes in
machine learning. Summer School on Machine Learning,
63–71.
Rasmussen, C. E., & Williams, C. K. I. (2006).
Gaussian processes for machine learning.
MIT Press.
Reddi, S. J., Kale, S., & Kumar, S. (2019). On the
convergence of adam and beyond. arXiv Preprint
arXiv:1904.09237.
Reed, M., & Simon, B. (1980). Methods of modern
mathematical physics. Vol. I: Functional analysis. Academic
Press.
Reed, R., Oh, S., Marks, R., et al.
(1992). Regularization using jittered training data.
International Joint Conference on Neural Networks,
3, 147–152.
Rényi, A. (1961). On measures of entropy and information.
Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, Volume 1: Contributions to the
Theory of Statistics, 4, 547–562.
Rezende, D. J., & Mohamed, S. (2015). Variational inference
with normalizing flows. arXiv Preprint
arXiv:1505.05770.
Rissanen, J. (1978). Modeling by shortest data description.
Automatica, 14, 465–471.
Ritter, H., Botev, A., & Barber, D. (2018). A scalable
Laplace approximation for neural networks.
International Conference on Learning Representations,
6.
Robbins, H. (1952). Some aspects of the sequential design of
experiments. Bulletin of the American Mathematical
Society.
Roca, J. (1995). The mechanisms of DNA
topoisomerases. Trends in Biol. Chem., 20,
156–160.
Rockafellar, R. T. (1970). Convex analysis. Princeton
University Press. https://doi.org/doi:10.1515/9781400873173
Rodriguez-Santana, S., & Hernández-Lobato, D. (2022).
Adversarial α-divergence minimization
for Bayesian approximate inference.
Neurocomputing, 513, 410–421. https://doi.org/10.1016/j.neucom.2022.10.052
Rodrı́guez Santana, S., Zaldivar, B., & Hernández-Lobato, D.
(2021). Sparse implicit processes for approximate inference.
arXiv e-Prints, arXiv–2110.
Rodrı́guez-Santana, S., & Hernández-Lobato, D. (2022).
Adversarial α-divergence
minimization for Bayesian approximate inference.
Neurocomputing, 471, 260–274.
Rohwer, R., & Freitag, D. (2004). Towards full automation of
lexicon construction. In D. Moldovan & R. Girju (Eds.),
HLT-NAACL 2004: Workshop on computational lexical
semantics.
Roli, F., Giacinto, G., & Vernazza, G. (2001). Methods for
designing multiple classifier systems. International
Workshop on Multiple Classifier Systems, 78–87.
Ron, D., Singer, Y., & Tishby, N. (1995). On the
learnability and usage of acyclic probabilistic finite automata.
Proc. 8th Annu. Conf. On Comput. Learning Theory,
31–40.
Ron, D., Singer, Y., & Tishby, N. (1996). The power of
amnesia: Learning probabilistic automata with variable memory
length. Machine Learning, 25, 117–149.
Rooij, S. de, Erven, T. van, Grünwald, P. D., & Koolen, W.
M. (2014). Follow the leader if you can, hedge if you must.
Journal of Machine Learning Research.
Rose, K. (1998). Deterministic annealing for clustering,
compression, classification, regression and related optimization
problems. IEEE Transactions on Information Theory,
80, 2210–2239.
Ross, A., & Doshi-Velez, F. (2018). Improving the
adversarial robustness and interpretability of deep neural
networks by regularizing their input gradients. Proceedings
of the AAAI Conference on Artificial Intelligence,
32.
Rubin, D., & Stein, M. (2016). Spatially adaptive
Bayesian covariance tapering. International
Conference on Artificial Intelligence and Statistics
(AISTATS), 650–658.
Ruddigkeit, L., Van Deursen, R., Blum, L. C., & Reymond,
J.-L. (2012). Enumeration of 166 billion organic small molecules
in the chemical universe database GDB-17. Journal of
Chemical Information and Modeling, 52(11),
2864–2875.
Rudin, W. (1991). Functional analysis (2nd ed.).
McGraw–Hill.
Rui, Y., Huang, T., & Chang, S. (1999). Image retrieval:
Current techniques, promising directions and open issues.
Journal of Visual Communication and Image
Representation, 10(4), 39–62.
Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly
parametrized bandits. Mathematics of Operations
Research, 35.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma,
S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.
C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual
Recognition Challenge. International Journal of
Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
Sabato, S., & Shalev-Shwartz, S. (2007). Prediction by
categorical features: Generalization properties and application
to feature ranking. Proceedings of the Conference on
Learning Theory (COLT).
Sabato, S., & Shalev-Shwartz, S. (2008). Ranking categorical
features using generalization properties. Journal of Machine
Learning Research, 9.
Sajda, P. (2006). Machine learning for detection and diagnosis
of disease. Annu. Rev. Biomed. Eng., 8,
537–565.
Salakhutdinov, R., & Mnih, A. (2008). Bayesian probabilistic
matrix factorization using Markov chain monte
carlo. Proceedings of the International Conference on
Machine Learning (ICML).
Salimbeni, H., & Deisenroth, M. (2017a). Doubly stochastic
variational inference for deep Gaussian processes.
Advances in Neural Information Processing Systems,
30.
Salimbeni, H., & Deisenroth, M. P. (2017b). Doubly
stochastic variational inference for deep Gaussian
processes. Advances in Neural Information Processing Systems
30, 4588–4599.
Samson, P.-M. (2000). Concentration of measure inequalities for
markov chains and Φ-mixing processes. The
Annals of Probability, 28(1).
Sannai, A., Polyanskiy, Y., & Watanabe, Y. (2019). Strong
data processing inequalities and Φ-sobolev inequalities for
discrete channels. 2019 IEEE International Symposium on
Information Theory (ISIT), 447–451.
Santana, S. R., Zaldivar, B., & Hernández-Lobato, D. (2021).
Sparse implicit processes for approximate inference. arXiv
Preprint arXiv:2110.07618.
Sato, M. A. (2001). Online model selection based on the
variational Bayes. Neural Computation,
13(7), 1649–1681.
Saul, L. K., & Jordan, M. I. (1999). Mixed memory markov
models: Decomposing complex stochastic processes as mixtures of
simpler ones. Machine Learning, 37(1), 75–87.
Scannell, A., Mereu, R., Chang, P., Tamir, E., Pajarinen, J.,
& Solin, A. (2024). Function-space parameterization of
neural networks for sequential learning. International
Conference on Learning Representations.
Schaeffer, S. E. (2007). Graph clustering. Computer Science
Review.
Schoenholz, S. S., Gilmer, J., Ganguli, S., &
Sohl-Dickstein, J. (2017). Deep information propagation.
International Conference on Learning Representations.
https://openreview.net/forum?id=H1W1UN9gg
Schölkopf, B., & Smola, A. (2002). Learning with
kernels. Support vector machines, regularization, optimization
and beyond. MIT Press.
Seeger, M. (2002). PAC-Bayesian
generalisation error bounds for gaussian process classification.
Journal of Machine Learning Research, 3,
233–269.
Seeger, M. (2003a). Bayesian Gaussian process
models: PAC-Bayesian generalisation
error bounds and sparse approximations. University of
Edinburgh.
Seeger, M. (2003b). Bayesian Gaussian process
models: PAC-Bayesian generalization error bounds
and sparse approximations [PhD thesis]. University of
Edinburgh.
Seeger, M. (2003c). Fast forward selection to speed up sparse
Gaussian process regression. Proceedings of the
9th International Workshop on Artificial Intelligence and
Statistics.
Segal, E., Pe’er, D., Regev, A., Koller, D., & Friedman, N.
(2005). Learning module networks. Journal of Machine
Learning Research.
Seldin, Y. (2001). On unsupervised learning of mixtures of
Markovian sources [Master’s thesis]. The
Hebrew University of Jerusalem.
Seldin, Y. (2005a). 3D-3R:
3D Content Rating,
Ranking, and Recording (White
Paper IA-R611). NDS Technologies Ltd.
Seldin, Y. (2005b). Personalized navigation in digital
TV world. Content filtering based on
personalized ratings [Unpublished manuscript].
Seldin, Y. (2009). A PAC-Bayesian
approach to structure learning [PhD thesis]. The Hebrew
University of Jerusalem.
Seldin, Y. (2010). A PAC-Bayesian analysis of
graph clustering and pairwise clustering.
http://arxiv.org/abs/1009.0499.
Seldin, Y. (2015). The space of online learning
problems. ECML-PKDD Tutorial. https://sites.google.com/site/spaceofonlinelearningproblems/.
Seldin, Y., Auer, P., Abbasi-Yadkori, Y., & Szepesvári, C.
(2012a). Evaluation and analysis of the performance of the
EXP3 algorithm in stochastic environments.
Proceedings of the European Workshop on Reinforcement
Learning (EWRL).
Seldin, Y., Auer, P., Laviolette, F., Shawe-Taylor, J., &
Ortner, R. (2011a). PAC-Bayesian analysis of
contextual bandits. Advances in Neural Information
Processing Systems (NeurIPS).
Seldin, Y., Bartlett, P. L., & Crammer, K. (2013a).
Advice-efficient prediction with expert advice.
http://arxiv.org/abs/1304.3708.
Seldin, Y., Bartlett, P. L., Crammer, K., & Abbasi-Yadkori,
Y. (2014). Prediction with limited advice and multiarmed bandits
with paid observations. Proceedings of the International
Conference on Machine Learning (ICML).
Seldin, Y., Bejerano, G., & Tishby, N. (2001a). Unsupervised
segmentation and classification of mixtures of
Markovian sources. Proceedings of the 33rd
Symposium on the Interface of Computing Science and
Statistics.
Seldin, Y., Bejerano, G., & Tishby, N. (2001b). Unsupervised
sequence segmentation by a mixture of switching variable memory
Markov sources. Proceedings of the
International Conference on Machine Learning (ICML).
Seldin, Y., Cesa-Bianchi, N., Auer, P., Laviolette, F., &
Shawe-Taylor, J. (2012b). PAC-Bayes-Bernstein
inequality for martingales and its application to multiarmed
bandits. "Journal of Machine Learning Research"workshop and
Conference Proceedings, 26.
Seldin, Y., Cesa-Bianchi, N., Laviolette, F., Auer, P.,
Shawe-Taylor, J., & Peters, J. (2011b).
PAC-Bayesian analysis of the
exploration-exploitation trade-off. Online Trading of
Exploration and Exploitation 2, ICML Workshop.
Seldin, Y., Crammer, K., & Bartlett, P. L. (2013b). Open
problem: Adversarial multiarmed bandits with limited advice.
Proceedings of the Conference on Learning Theory
(COLT).
Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J.,
& Auer, P. (2012c). PAC-Bayesian inequalities
for martingales. IEEE Transactions on Information
Theory, 58.
Seldin, Y., Laviolette, F., Shawe-Taylor, J., Peters, J., &
Auer, P. (2011c). PAC-Bayesian analysis of
martingales and multiarmed bandits.
http://arxiv.org/abs/1105.2416.
Seldin, Y., & Lugosi, G. (2016). A lower bound for
multi-armed bandits with expert advice. Proceedings of the
European Workshop on Reinforcement Learning (EWRL).
Seldin, Y., & Lugosi, G. (2017). An improved parametrization
and analysis of the EXP3++ algorithm for stochastic
and adversarial bandits. Proceedings of the Conference on
Learning Theory (COLT).
Seldin, Y., & Schölkopf, B. (2013). On the relations and
differences between Popper dimension, exclusion
dimension and VC-dimension. In B. Schölkopf, Z.
Luo, & V. Vovk (Eds.), Empirical inference – festshrift
in honor of vladimir n. vapnik. Springer.
Seldin, Y., & Slivkins, A. (2014). One practical algorithm
for both stochastic and adversarial bandits. Proceedings of
the International Conference on Machine Learning (ICML).
Seldin, Y., Slonim, N., & Tishby, N. (2007). Information
bottleneck for non co-occurrence data. Advances in Neural
Information Processing Systems (NeurIPS).
Seldin, Y., Starik, S., & Werman, M. (2003). Unsupervised
clustering of images using their joint segmentation. The
3rd
International Workshop on Statistical and Computational Theories
of Vision (SCTV).
Seldin, Y., Szepesvári, C., Auer, P., & Abbasi-Yadkori, Y.
(2013c). Evaluation and analysis of the performance of the
EXP3 algorithm in stochastic environments.
"Journal of Machine Learning Research"workshop and
Conference Proceedings, 24 (EWRL).
Seldin, Y., & Tishby, N. (2008). Multi-classification by
categorical features via clustering. Proceedings of the
International Conference on Machine Learning (ICML).
Seldin, Y., & Tishby, N. (2009a). PAC-Bayesian
generalization bound for density estimation with application to
co-clustering. Proceedings on the International Conference
on Artificial Intelligence and Statistics (AISTATS).
Seldin, Y., & Tishby, N. (2009b).
PAC-Bayesian generalization bound for
density estimation with application to co-clustering.
Artificial Intelligence and Statistics, 472–479.
Seldin, Y., & Tishby, N. (2010). PAC-Bayesian
analysis of co-clustering and beyond. Journal of Machine
Learning Research, 11.
Shafiei, M. M., & Milios, E. E. (2006). Model-based
overlapping co-clustering. Proceeding of SIAM Conference on
Data Mining.
Shalaeva, V., Esfahani, A. F., Germain, P., & Petreczky, M.
(2019). Improved PAC-Bayesian bounds for linear
regression. arXiv Preprint arXiv:1912.03036.
Shalev-Shwartz, S. (2012). Online learning and online convex
optimization. Foundations and Trends in Machine
Learning, 4(2).
Shalev-Shwartz, S., & Ben-David, S. (2014).
Understanding machine learning: From theory to
algorithms. Cambridge University Press.
Shalev-Shwartz, S., & Birnbaum, A. (2012). Learning
halfspaces with the zero-one loss: Time-accuracy tradeoffs.
Advances in Neural Information Processing Systems
(NeurIPS).
Shalev-Shwartz, S., Shamir, O., & Tromer, E. (2012). Using
more data to speed-up training time. Proceedings on the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Shalev-Shwartz, S., & Srebro, N. (2008). SVM optimization:
Inverse dependence on training set size. Proceedings of the
International Conference on Machine Learning (ICML).
Shamir, O., Sabato, S., & Tishby, N. (2008). Learning and
generalization with the information bottleneck. Proceeding
of the International Symposium on AI and Mathematics
(ISAIM).
Shamir, O., & Tishby, N. (2008a). Cluster stability for
finite samples. Advances in Neural Information Processing
Systems (NeurIPS).
Shamir, O., & Tishby, N. (2008b). Model selection and
stability in k-means
clustering. Proceedings of the Conference on Learning Theory
(COLT).
Shamir, O., & Tishby, N. (2009). On the reliability of
clustering stability in the large sample regime. Advances in
Neural Information Processing Systems (NeurIPS).
Shan, H., & Banerjee, A. (2008). Bayesian co-clustering.
IEEE International Conference on Data Mining (ICDM).
Shannon, C. E. (1948). A mathematical theory of communication.
Bell System Technical Journal, 27(3), 379–423.
Shanon, C. E. (1948). A mathematical theory of communication.
Bell Sys. Tech. Journal, 27, 379–423, 623–656.
Shashua, A., Zass, R., & Hazan, T. (2006). Multi-way
clustering using super-symmetric non-negative tensor
factorization. European Conference on Computer Vision
(ECCV).
Shawe-Taylor, J., Archambeau, C., Higgs, M., & Opper, M.
(2009). PAC-Bayes analysis of
Bayesian inference.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., &
Anthony, M. (1998a). Structural risk minimization over
data-dependent hierarchies. IEEE Transactions on Information
Theory, 44(5).
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., &
Anthony, M. (1998b). Structural risk minimization over
data-dependent hierarchies. IEEE Transactions on Information
Theory, 44(5), 1926–1940.
Shawe-Taylor, J., & Christianini, N. (2004). Kernel
methods for pattern analysis. Cambridge University Press.
Shawe-Taylor, J., Cristianini, N., et
al. (2004). Kernel methods for pattern analysis.
Cambridge university press.
Shawe-Taylor, J., & Dolia, A. (2007). A framework for
probability density estimation. Proceedings on the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Shawe-Taylor, J., & Hardoon, D. (2009). PAC-bayes analysis
of maximum entropy classification. Proceedings on the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Shawe-Taylor, J., & Williamson, R. C. (1997). A
PAC analysis of a Bayesian estimator.
Proceedings of the Conference on Learning Theory
(COLT).
Shen, R., Bubeck, S., & Gunasekar, S. (2022). Data
augmentation as feature manipulation: A story of desert cows and
grass cows. arXiv Preprint arXiv:2203.01572.
Sheth, R., & Khardon, R. (2020).
Pseudo-Bayesian learning via direct loss
minimization with applications to sparse Gaussian
process models. Symposium on Advances in Approximate
Bayesian Inference, 1–18.
Shi, J., & Malik, J. (2000). Normalized cuts and image
segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(8).
Shi, J., Sun, S., & Zhu, J. (2018). A spectral approach to
gradient estimation for implicit distributions.
International Conference on Machine Learning,
4644–4653.
Shi, T., & Zhu, J. (2014). Online Bayesian
passive-aggressive learning. Proceedings of the
International Conference on Machine Learning (ICML),
378–386.
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image
data augmentation for deep learning. Journal of Big
Data, 6(1), 1–48.
Silva, P. R. da. (2006). An introduction to measure
theory. Springer.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Driessche, G. van den, Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham,
J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M.,
Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016).
Mastering the game of Go with deep neural networks
and tree search. Nature, 529.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,
Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A.,
Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G. van
den, Graepel, T., & Hassabis, D. (2017). Mastering the game
of Go without human knowledge. Nature,
550.
Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best
practices for convolutional neural networks applied to visual
document analysis. ICDAR, 3, 958–962.
Simon, B. (2005). Trace ideals and their applications.
American Mathematical Soc.
Singer, Y. (1997). Adaptive mixtures of probabilistic
transducers. NeuroComputing, 9(8), 1711–1733.
Singer, Y., & Tishby, N. (1993). Decoding cursive scripts.
Advances in Neural Information Processing Systems
(NeurIPS).
Singh, P. N. (2021). Better application of Bayesian
deep learning to diagnose disease. 2021 5th International
Conference on Computing Methodologies and Communication
(ICCMC), 928–934.
S.Krishnamachari, & M.Abdel-Mottaleb. (1999). Hierarchical
clustering algorithm for fast image retrieval. IS&t/SPIE
Conference on Storage and Retrieval for Image and Video
Databases VII, 427–435.
Slonim, N. (2002). The information bottleneck: Theory and
applications [PhD thesis]. The Hebrew University of
Jerusalem.
Slonim, N., Atwal, G. S., Tracik, G., & Bialek, W. (2005).
Information-based clustering. Proceedings of the National
Academy of Science, 102(51).
Slonim, N., Fine, S., & Tishby, N. (2001, January).
Desciminative variable memory markov model for feature
selection. Submitted to ICML 2001.
Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised
document classification using sequential information
maximization. Proceedings of the Annual International ACM
SIGIR Conference on Research and Development in Information
Retrieval.
Slonim, N., Friedman, N., & Tishby, N. (2006). Multivariate
information bottleneck. Neural Computation,
18.
Slonim, N., & Tishby, N. (2000). Document clustering using
word clusters via the information bottleneck method.
Proceedings of the Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval.
Slonim, N., & Weiss, Y. (2002). Maximum likelihood and the
information bottleneck. Advances in Neural Information
Processing Systems (NeurIPS).
Smith, S. L., & Le, Q. V. (2017). A
Bayesian perspective on generalization and
stochastic gradient descent. https://arxiv.org/abs/1710.06451
Smola, A. J., & Schölkopf, B. (2000). Sparse greedy matrix
approximation for machine learning. Proceedings of the 17th
International Conference on Machine Learning, 911–918.
Smolkin, M., & Ghosh, D. (2003). Cluster stability scores
for microarray data in cancer studies. BMC
Bioinformatics, 36(4).
Snelson, E., & Ghahramani, Z. (2006). Sparse
Gaussian processes using pseudo-inputs.
Advances in Neural Information Processing Systems 18,
1257–1264.
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical
Bayesian optimization of machine learning
algorithms. Advances in Neural Information Processing
Systems 25, 2951–2959.
Snoek, J., Ovadia, Y., Fertig, E., Lakshminarayanan, B.,
Nowozin, S., Sculley, D., Dillon, J., Ren, J., & Nado, Z.
(2019). Can you trust your model’s uncertainty?
Evaluating predictive uncertainty under dataset
shift. Advances in Neural Information Processing
Systems, 13969–13980.
Sokolic, J., Giryes, R., Sapiro, G., & Rodrigues, M. (2017).
Generalization error of invariant classifiers. Artificial
Intelligence and Statistics, 1094–1103.
Solomonoff, R. J. (1960). A preliminary report on a general
theory of inductive inference. Zator Company, Cambrige, MA.
Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., &
Srebro, N. (2018). The implicit bias of gradient descent on
separable data. The Journal of Machine Learning
Research, 19(1), 2822–2878.
Sprinzak, E. (2004). Studying interacting proteins by
computational approaches [PhD thesis]. The Hebrew
University of Jerusalem.
Srebro, N. (2004). Learning with matrix factorizations
[PhD thesis]. MIT.
Srebro, N., Alon, N., & Jaakkola, T. S. (2005a).
Generalization error bounds for collaborative prediction with
low-rank matrices. Advances in Neural Information Processing
Systems (NeurIPS).
Srebro, N., Rennie, J., & Jaakkola, T. (2005b). Maximum
margin matrix factorization. Advances in Neural Information
Processing Systems (NeurIPS).
Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M.
(2009). Gaussian process optimization in the
bandit setting: No regret and experimental design.
http://arxiv.org/abs/0912.3995.
Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M.
(2010). Gaussian process optimization in the bandit
setting: No regret and experimental design. Proceedings of
the International Conference on Machine Learning (ICML).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &
Salakhutdinov, R. (2014). Dropout: A simple way to prevent
neural networks from overfitting. Journal of Machine
Learning Research.
Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C.
(2012). Man vs. Computer: Benchmarking machine learning
algorithms for traffic sign recognition. Neural
Networks, 32, 323–332.
Stein, M. L. (1999). Interpolation of spatial data: Some
theory for kriging. Springer.
Steinwart, I., & Christmann, A. (2008). Support vector
machines.
Steyvers, M., & Griffiths, T. (2006). Probabilistic topic
models. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch
(Eds.), Latent semantic analysis: A road to meaning.
Laurence Erlbaum.
Stoltz, G. (2005). Incomplete information and internal
regret in prediction of individual sequences [PhD thesis].
Université Paris-Sud.
Strang, G. (2009). Introduction to linear algebra
(4th).
Wellesley-Cambridge Press.
Strehl, A. L., Li, L., & Littman, M. L. (2009).
Reinforcement learning in finite MDPs:
PAC analysis. Journal of Machine Learning
Research.
Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H.
(2006). Experience-efficient learning in associative bandit
problems. Proceedings of the International Conference on
Machine Learning (ICML).
Stroock, D. W. (2010). Probability theory: An analytic
view. Cambridge university press.
Stuart, E. T., Kioussi, C., & Gruss, P. (1994). Mammalian
Pax genes. Annu. Rev. Genet., 28,
219–236.
Subramanian, V., Arya, R., & Sahai, A. (2022).
Generalization for multiclass classification with
overparameterized linear models. Advances in Neural
Information Processing Systems, 35, 23479–23494.
Sun, S., Zhang, G., Shi, J., & Grosse, R. (2019). Functional
variational Bayesian neural networks.
International Conference on Learning Representations.
Sutskever, I., & Hinton, G. E. (2008). Deep, narrow sigmoid
belief networks are universal approximators. Neural
Computation, 20(11), 2629–2636.
Sutskever, I., Salakhutdinov, R., & Tenenbaum, J. B. (2009).
Modelling relational data using Bayesian clustered
tensor factorization. Advances in Neural Information
Processing Systems (NeurIPS).
Sutton, R. S., & Barto, A. G. (1998). Reinforcement
learning: An introduction. MIT Press.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna,
Z. (2016). Rethinking the inception architecture for computer
vision. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2818–2826.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D.,
Goodfellow, I., & Fergus, R. (2013). Intriguing properties
of neural networks. arXiv Preprint arXiv:1312.6199.
Takamura, H., & Matsumoto, Y. (2003). Co-clustering for text
categorization. Information Processing Society of Japan
Journal.
Tang, E. K., Suganthan, P. N., & Yao, X. (2006). An analysis
of diversity measures. Machine Learning,
65(1), 247–271.
Tang, L., Hanka, R., Ip, H., Cheung, K., & Lam, R. (2000).
Semantic query processing and annotation generation for
content-based retrieval of histological images. SPIE
EMedical Imagingg 2000, Document Recognition and Retrieval
IX.
Taskar, B., Abbeel, P., Wong, M.-F., & Koller, D. (2007).
Relational markov networks. In L. Getoor & B. Taskar (Eds.),
Introduction to statistical relational learning. MIT
Press.
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin
Markov networks. Advances in Neural Information
Processing Systems (NeurIPS).
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M.
(2004). Hierarchical dirichlet processes (No. 653).
Department of Statistics, University of California, Berkeley.
Teh, Y. W., Jordan, M., Beal, M., & Blei, D. (2006).
Hierarchical Dirichlet processes. Journal of
the American Statistical Association, 101.
Thiemann, N. (2016). PAC-Bayesian ensemble
learning [Master’s thesis]. University of Copenhagen.
Thiemann, N., Igel, C., & Seldin, Y. (2016).
PAC-Bayesian aggregation without
cross-validation. http://arxiv.org/abs/1608.05610.
Thiemann, N., Igel, C., Wintenberger, O., & Seldin, Y.
(2017a). A strongly quasiconvex PAC-Bayesian bound.
Proceedings of the International Conference on Algorithmic
Learning Theory (ALT).
Thiemann, N., Igel, C., Wintenberger, O., & Seldin, Y.
(2017b). A strongly quasiconvex
PAC-Bayesian bound. International
Conference on Algorithmic Learning Theory, 466–492.
Thompson, W. R. (1933). On the likelihood that one unknown
probability exceeds another in view of the evidence of two
samples. Biometrika, 25.
Thune, T. S., & Seldin, Y. (2018). Adaptation to easy data
in prediction with limited advice. Advances in Neural
Information Processing Systems (NeurIPS).
Tipping, M. E. (2001). Sparse Bayesian learning and
the relevance vector machine. Journal of Machine Learning
Research, 1, 211–244.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic
principal component analysis. Journal of the Royal
Statistical Society: Series B (Statistical Methodology),
61(3), 611–622.
Tishby, N., Pereira, F., & Bialek, W. (1999). The
information bottleneck method. Allerton Conference on
Communication, Control and Computation.
Tishby, N., & Polani, D. (2010). Information theory of
decisions and actions. In V. Cutsuridis, A. Hussain, J. G.
Taylor, & D. Polani (Eds.), Perception-reason-action
cycle: Models, algorithms and systems. Springer.
Tishby, N., & Slonim, N. (2000). Data clustering by
markovian relaxation and the information bottleneck method.
Advances in Neural Information Processing Systems
(NeurIPS).
Tishby, N., & Zaslavsky, N. (2015). Deep learning and the
information bottleneck principle. 2015 Ieee Information
Theory Workshop (Itw), 1–5.
Titsias, M. (2009). Variational learning of inducing variables
in sparse Gaussian processes. Artificial
Intelligence and Statistics, 567–574.
Titsias, M., & Lawrence, N. D. (2010). Bayesian
Gaussian process latent variable model.
Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics, 844–851.
Tolstikhin, I. O., & Seldin, Y. (2013a).
PAC-Bayes-Empirical-Bernstein
inequality. Advances in Neural Information Processing
Systems, 109–117.
Tolstikhin, I., & Seldin, Y. (2013b).
PAC-Bayes-Empirical-Bernstein inequality.
Advances in Neural Information Processing Systems
(NeurIPS).
Touchette, H. (2009). The large deviation approach to
statistical mechanics. Physics Reports,
478(1-3), 1–69.
Tran-Thanh, L., Stavrogiannis, L., Naroditskiy, V., Robu, V.,
Jennings, N. R., & Key, P. (2014). Efficient regret bounds
for online bid optimisation in budget-limited sponsored search
auctions. Proceedings of the Conference on Uncertainty in
Artificial Intelligence.
Triantafyllopoulos, K. (2009). Inference of dynamic generalized
linear models: On-line computation and appraisal.
International Statistical Review, 77(3),
430–450.
Vakhania, N. N., Tarieladze, V. I., & Chobanyan, S. A.
(1987). Probability distributions on banach spaces.
Reidel.
Valentini, G., & Dietterich, T. G. (2003). Low bias bagged
support vector machines. Proceedings of the International
Conference on Machine Learning (ICML).
Valiant, L. G. (1984). A theory of the learnable.
Communications of the Association for Computing
Machinery, 27.
Van Neerven, J. et al. (2010). γ-radonifying operators—a
survey. The AMSI-ANU Workshop on Spectral Theory and
Harmonic Analysis, 44, 1–61.
Vapnik, V. (1992). Principles of risk minimization for learning
theory. Advances in Neural Information Processing
Systems, 831–838.
Vapnik, V. N. (1995). The nature of statistical learning
theory. Springer-Verlag New York, Inc.
Vapnik, V. N. (1998b). Statistical learning theory.
Wiley.
Vapnik, V. N. (1998a). Statistical learning theory.
John Wiley & Sons.
Vapnik, V. N. (1998c). Statistical learning theory.
Wiley.
Vapnik, V. N., & Chervonenkis, A. Y. (2015). On the uniform
convergence of relative frequencies of events to their
probabilities. In Measures of complexity: Festschrift for
alexey chervonenkis (pp. 11–30). Springer.
Vapnik, V. N., & Chervonenkis, A. Ya. (1968). On the uniform
convergence of relative frequencies of events to their
probabilities. Soviet Math. Dokl., 9.
Vapnik, V. N., & Chervonenkis, A. Ya. (1971). On the uniform
convergence of relative frequencies of events to their
probabilities. Theory of Probability and Its
Applications, 16(2).
Vapnik, V. N., & Chervonenkis, A. Ya. (1974). Theory of
pattern recognition. Nauka, Moscow (in Russian).
Vapnik, V. N., & Chervonenkis, A. Ya. (1981). Necessary and
sufficient conditions for the uniform convergence of means to
their expectations. Theory of Probability and Its
Applications, 26(3), 532–553.
Varga, D., Csiszárik, A., & Zombori, Z. (2017). Gradient
regularization improves accuracy of discriminative models.
arXiv Preprint arXiv:1712.09936.
Varshney, K. R., & Alemzadeh, H. (2017). On the safety of
machine learning: Cyber-physical systems, decision sciences, and
data products. Big Data, 5(3), 246–255.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention
is All you need. Advances in Neural Information
Processing Systems, 5998–6008.
Vidal, R., Bruna, J., Giryes, R., & Soatto, S. (2020).
Mathematics of deep learning. Mexican Conference on Pattern
Recognition.
Villacampa-Calvo, C., & Hernández-Lobato, D. (2020). Alpha
divergence minimization in multi-class Gaussian
process classification. Neurocomputing, 378,
210–227.
Virmaux, A., & Scaman, K. (2018). Lipschitz regularity of
deep neural networks: Analysis and efficient estimation.
Advances in Neural Information Processing Systems,
31.
Vovk, V. (1990). Aggregating strategies. Proceedings of the
Conference on Learning Theory (COLT).
Wainwright, M. J., Jordan, M. I., et
al. (2008). Graphical models, exponential families, and
variational inference. Foundations and Trends®
in Machine Learning, 1(1–2), 1–305.
Walker, S. G. (2013). Bayesian inference with
misspecified models. Journal of Statistical Planning and
Inference, 143(10), 1621–1633.
Wallace, C. S., & Boulton, D. M. (1968). An information
measure for classification. The Computer Journal,
11(2), 185–195.
Wang, C., & Blei, D. M. (2018). A general method for robust
Bayesian modeling. Bayesian Analysis,
13(4), 1163–1191.
Wang, J. Z., Wiederhold, G., Firschein, O., & Wei, S. X.
(1997). Content-based image indexing and searching using
daubechies’ wavelets. Int. J. On Digital Libraries,
1(4), 311–328. citeseer.nj.nec.com/wang98contentbased.html
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining
actionlet ensemble for action recognition with depth cameras.
2012 IEEE Conference on Computer Vision and Pattern
Recognition, 1290–1297.
Wang, P., Domeniconi, C., & Laskey, K. B. (2009). Latent
Dirichlet Bayesian co-clustering. Proceedings
of European Conference on Machine Learning and Principles and
Practice of Knowledge Discovery in Databases (ECML/PKDD).
Wang, Y., & Blei, D. (2019). Variational Bayes
under model misspecification. Advances in Neural Information
Processing Systems, 13357–13367.
Wang, Y., Kucukelbir, A., & Blei, D. M. (2017). Robust
probabilistic modeling with Bayesian data
reweighting. International Conference on Machine
Learningd, 3646–3655.
Wang, Y., Sonthalia, R., & Hu, W. (2024).
Near-interpolators: Rapid norm growth and the trade-off between
interpolation and generalization. International Conference
on Artificial Intelligence and Statistics, 4483–4491.
Wei, C.-Y., & Luo, H. (2018). More adaptive algorithms for
adversarial bandits. Proceedings of the Conference on
Learning Theory (COLT).
Wei, Y., Sheth, R., & Khardon, R. (2020). Direct loss
minimization for sparse Gaussian processes.
arXiv Preprint arXiv:2004.03083.
Wen, Y., Tran, D., & Ba, J. (2019). BatchEnsemble: An
alternative approach to efficient ensemble and lifelong
learning. International Conference on Learning
Representations.
Wen, Y., Vicol, P., Ba, J., Tran, D., & Grosse, R. (2018).
Flipout: Efficient pseudo-independent weight perturbations on
mini-batches. International Conference on Learning
Representations. https://openreview.net/forum?id=rJNpifWAb
Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran, L.,
Mandt, S., Snoek, J., Salimans, T., Jenatton, R., & Nowozin,
S. (2020a). How good is the Bayes posterior in deep
neural networks really? International Conference on Machine
Learning, 10248–10259.
Wenzel, F., Snoek, J., Tran, D., & Jenatton, R. (2020b).
Hyperparameter ensembles for robustness and uncertainty
quantification. arXiv Preprint arXiv:2006.13570.
Wiatowski, T., & Bölcskei, H. (2017). A mathematical theory
of deep convolutional neural networks for feature extraction.
IEEE Transactions on Information Theory,
64(3), 1845–1866.
Willems, F. M. J. (1998). The context-tree weighting method:
extensions. IEEE Transactions on Information Theory,
792–798.
Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J.
(1994). Context weighting for general finite context sources.
IEEE Transactions on Information Theory.
Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J.
(1995). The context-tree weighting method: Basic properties.
IEEE Transactions on Information Theory,
41(3).
Williamson, S., Orbanz, P., & Ghahramani, Z. (2010a).
Dependent Indian buffet processes. Proceedings
of the Thirteenth International Conference on Artificial
Intelligence and Statistics, 924–931.
Williamson, S., Wang, C., Heller, K., & Blei, D. (2010b).
The IBP compound Dirichlet process and
its application to focused topic modeling. Proceedings of
ICML.
Wilson, A. G., & Nickisch, H. (2015). Kernel interpolation
for scalable structured Gaussian processes
(KISS-GP). International Conference on Machine Learning
(ICML), 1775–1784.
Wilson, A. G. (2020). The case for Bayesian deep
learning. arXiv Preprint arXiv:2001.10995.
Wilson, A. G., & Izmailov, P. (2020). Bayesian
deep learning and a probabilistic perspective of generalization.
arXiv Preprint arXiv:2002.08791.
Winn, J. M., & Bishop, C. M. (2005). Variational message
passing. Journal of Machine Learning Research,
6, 661–694.
Wintenberger, O. (2017). Optimal learning with
Bernstein online aggregation. Machine
Learning, 106.
Witten, D. M., & Tibshirani, R. (2009).
Covariance-regularized regression and classification for high
dimensional problems. Journal of the Royal Statistical
Society: Series B (Statistical Methodology),
71(3), 615–636.
Wood, J., & Shawe-Taylor, J. (1996). Representation theory
and invariant neural networks. Discrete Applied
Mathematics, 69(1-2), 33–60.
Wu, H., & Liu, X. (2016). Double thompson sampling for
dueling bandits. Advances in Neural Information Processing
Systems (NeurIPS).
Xiao, H., Rasul, K., & Vollgraf, R. (2017).
Fashion-MNIST: A novel image dataset for
benchmarking machine learning algorithms. arXiv Preprint
arXiv:1708.07747.
Xie, C., Ye, H., Chen, F., Liu, Y., Sun, R., & Li, Z.
(2020). Risk variance penalization. arXiv Preprint
arXiv:2006.07544.
Xu, A., & Raginsky, M. (2017). Information-theoretic
analysis of generalization capability of learning algorithms.
Advances in Neural Information Processing Systems,
30.
Xu, R., & II, D. W. (2005). Survey of clustering algorithms.
IEEE Transactions on Neural Networks, 16(3).
Yakowitz, S. J., & Spragins, J. D. (1968). On the
identifiability of finite mixtures. Annals of Mathematics
and Statistics, 39, 209–214.
Yang, J., Sun, S., & Roy, D. M. (2019). Fast-rate
PAC-Bayes generalization bounds via
shifted rademacher processes. Advances in Neural Information
Processing Systems, 10802–10812.
Yao, L., Mimno, D., & McCallum, A. (2009). Efficient methods
for topic model inference on streaming document collections.
Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 937–946.
Ykhlef, H., & Bouchaffra, D. (2017). An efficient ensemble
pruning approach based on simple coalitional games.
Information Fusion, 34, 28–42.
Yom-Tov, E., & Slonim, N. (2009). Parallel pairwise
clustering. SIAM International Conference on Data Mining
(SDM).
Yona, G. (1999). Methods for global organization of all
known protein sequences [PhD thesis]. The Hebrew University
of Jerusalem.
Yoo, J., & Choi, S. (2009a). Probabilistic matrix
tri-factorization. Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing
(ICASSP).
Yoo, J., & Choi, S. (2009b). Weighted nonnegative matrix
co-tri-factorization for collaborative prediction.
Proceedings of the Asian Conference on Machine Learning
(ACML).
Yu, H., Chen, Y., Low, B. K. H., Jaillet, P., & Dai, Z.
(2019). Implicit posterior variational inference for deep
Gaussian processes. Advances in Neural
Information Processing Systems, 32, 14475–14486.
Yu, Y., Li, Y.-F., & Zhou, Z.-H. (2011). Diversity
regularized machine. Twenty-Second International Joint
Conference on Artificial Intelligence.
Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2012).
The K-armed dueling
bandits problem. Journal of Computer and System
Sciences, 78.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and
understanding convolutional networks. European Conference on
Computer Vision, 818–833.
Zhang, C., Butepage, J., Kjellstrom, H., & Mandt, S. (2018).
Advances in variational inference. IEEE Transactions on
Pattern Analysis and Machine Intelligence.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O.
(2017). Understanding deep learning requires rethinking
generalization. Proceedings of the International Conference
on Learning Representations (ICLR).
Zhang, R., Li, C., Zhang, J., Chen, C., & Wilson, A. G.
(2019). Cyclical stochastic gradient MCMC for
Bayesian deep learning. International
Conference on Learning Representations.
Zhang, T. et al. (2006). From
epsilon-entropy to KL-entropy: Analysis of minimum information
complexity density estimation. The Annals of
Statistics, 34(5), 2180–2210.
Zhang, T. (2006). Information-theoretic upper and lower bounds
for statistical estimation. IEEE Transactions on Information
Theory, 52(4), 1307–1321.
Zheng, Y., Li, Q., Chen, Y., Xie, X., & Ma, W.-Y. (2008).
Understanding mobility based on GPS data.
Proceedings of the 10th International Conference on
Ubiquitous Computing, UbiComp ’08, 312–321. https://doi.org/10.1145/1409635.1409677
Zheng, Y., Xie, X., & Ma, W.-Y. (2010). GeoLife:
A collaborative social networking service among user,
location and trajectory. IEEE Data Eng. Bull.,
33(2), 32–39.
Zheng, Y., Zhang, L., Xie, X., & Ma, W.-Y. (2009). Mining
interesting locations and travel sequences from GPS
trajectories. Proceedings of the 18th International
Conference on World Wide Web, WWW ’09, 791–800. https://doi.org/10.1145/1526709.1526816
Zhou, W., Veitch, V., Austern, M., Adams, R. P., & Orbanz,
P. (2019). Non-vacuous generalization bounds at the ImageNet
scale: A PAC-Bayesian compression
approach. International Conference on Learning
Representations. https://openreview.net/forum?id=BJgqqsAct7
Zhou, X., Xie, L., Zhang, P., & Zhang, Y. (2014). An
ensemble of deep neural networks for object tracking. 2014
IEEE International Conference on Image Processing (ICIP),
843–847.
Zhou, Z.-H. (2012). Ensemble methods: Foundations and
algorithms. CRC press.
Zhou, Z.-H., & Li, N. (2010). Multi-information ensemble
diversity. International Workshop on Multiple Classifier
Systems, 134–144.
Zhu, H., & Rohwer, R. (1995a). Information geometric
measurements of generalisation.
Zhu, M. (2015). Use of majority votes in statistical learning.
WIREs Computational Statistics, 7.
Zhu, S., & Rohwer, R. (1995b). Information geometry and
prior construction. Entropy, 1, 3–22.
Zhu, S., An, B., & Huang, F. (2021a). Understanding the
generalization benefit of model invariance from a data
perspective. Advances in Neural Information Processing
Systems, 34, 4328–4341.
Zhu, Z. A., Liu, Y., Li, Y., Li, M., Lin, W., Hong, M., &
Jordan, M. I. (2021b). A geometric perspective on the
transferability of adversarial directions. Advances in
Neural Information Processing Systems, 34.
Zimmert, J., Luo, H., & Wei, C.-Y. (2019). Beating
stochastic and adversarial semi-bandits optimally and
simultaneously. Proceedings of the International Conference
on Machine Learning (ICML).
Zimmert, J., & Seldin, Y. (2019). An optimal algorithm for
stochastic and adversarial bandits. Proceedings on the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Zoghi, M., Karnin, Z., Whiteson, S., & Rijke, M. de. (2015).
Copeland dueling bandits. Advances in Neural Information
Processing Systems (NeurIPS).
Zoghi, M., Whiteson, S., Munos, R., & Rijke, M. de. (2014).
Relative upper confidence bound for the K-armed dueling bandit
problem. Proceedings of the International Conference on
Machine Learning (ICML).
Zolghadr, N., Bartók, G., Greiner, R., György, A., &
Szepesvári, C. (2013). Online learning with costly features and
labels. Advances in Neural Information Processing Systems
(NeurIPS).
Zou, D., Wu, J., Braverman, V., Gu, Q., Foster, D. P., &
Kakade, S. (2021). The benefits of implicit regularization from
SGD in least squares problems. Advances in Neural
Information Processing Systems.