Google2021: 深度学习中不确定性和鲁棒性的基线

【摘要】对不确定性和稳健性的高质量估计对于许多现实世界的应用至关重要，尤其是对于作为许多已部署 ML 系统基础的深度学习而言。因此，比较改进这些估计的技术的能力对于研究和实践都非常重要。然而，由于一系列原因，通常缺乏方法的竞争性比较，包括：用于广泛调整的计算可用性、合并足够多的基线以及用于再现性的具体文档。在本文中，我们介绍了不确定性基线：在各种任务上高质量地实施标准和最先进的深度学习方法。在撰写本文时，该集合涵盖 9 个任务的 19 种方法，每个方法至少有 5 个指标。每个基线都是一个独立的实验管道，具有易于重用和扩展的组件。我们的目标是为新方法或应用的实验提供直接起点。此外，我们还提供模型检查点、作为 Python 笔记本的实验输出以及用于比较结果的排行榜。 https://github.com/google/uncertainty-baselines

【原文】 Nado, Z. et al. (2021) Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning, arXiv e-prints. Available at: https://ui.adsabs.harvard.edu/abs/2021arXiv210604015N (Accessed: 8 April 2022).

1 简介

标准化基准的基线对于机器学习研究至关重要，可用于衡量新想法是否产生有意义的进展。然而，重现以前作品的结果可能极具挑战性，尤其是在仅阅读论文文本时（Sinha 等人，2020 年；D’Amour 等人，2020 年）。假设它有良好的文档记录和维护，访问实验代码更有用。但即使这样还不够。事实上，在对一系列作品的回顾性分析中，作者经常发现，由于实验协议有缺陷或调整不充分，更简单的基线在实践中效果最好（Melis 等人，2017 年；Kurach 等人，2019 年；Bello 等人., 2021 年；Nado 等人，2021 年）。

论文中提供了范围广泛的实验工件。一种流行的方法是用于运行实验的代码的 GitHub 转储，尽管缺少文档和测试。充其量，论文可能会提供积极维护的存储库，其中包含示例、模型检查点和充足的文档以扩展工作。然而，一篇论文只能走这么远：如果没有社区标准，每篇论文的代码库在实验协议和代码组织方面都不同，这使得很难在一个共同的基准内跨论文进行比较，更不用说在多篇论文的基础上联合构建了。为了应对这些挑战，我们创建了不确定性基线库。它为许多不确定性和分布外的鲁棒性任务提供了高质量的基线实现。每个基线都设计为独立的（即最小的依赖性）并且易于扩展。除了原始代码之外，我们还提供了大量工件，以便其他人可以调整任何基线以适应他们的工作流程。

相关工作。 OpenAI Baselines（Dhariwal 等人，2017 年）以类似的精神开展强化学习工作。先前关于不确定性和稳健性基准的工作包括 Riquelme 等人。 (2018);菲洛斯等人。 (2019);亨德里克斯和迪特里希 (2019)；奥瓦迪亚等。 (2019);杜森伯里等人。 (2020b)。这些都引入了一项新任务并评估了该任务的各种基线。在实践中，它们是无人维护的，专注于实验性的见解而不是代码库作为贡献。我们的工作提供了一组广泛的基准（在某些情况下，统一了上述基准），在这些基准上有更大的一组基线，并专注于设计可扩展、可分叉和经过良好测试的代码。

2 不确定性基线

Uncertainty Baselines 将每个基准设置为基础模型、训练数据集和一套评估指标的选择。

（1）基础模型（架构）包括 Wide ResNet 28-10（Zagoruyko 和 Komodakis，2016）、ResNet-50（He 等人，2016）、BERT（Devlin 等人，2018）和简单的 MLP。

（2）训练数据集包括标准机器学习数据集——CIFAR（Krizhevsky 等人，a、b）、ImageNet（Russakovsky 等人，2015 年）和 UCI（Dua 和 Graff，2017 年）以及更多现实世界的问题—— Clinc Intent Detection（Larson 等人，2019 年）、Kaggle 的糖尿病视网膜病变检测（Filos 等人，2019 年）和 Wikipedia Toxicity（Wulczyn 等人，2017 年）。这些跨越模式，例如表格、文本和图像。

（3）评估包括准确性等预测指标、选择性预测和校准误差等不确定性指标、推理延迟等计算指标，以及分布内和分布外数据集下的性能。

图 1：TensorFlow 或 Pytorch 后端下的实验结构。在端到端训练脚本中实例化数据集（Cifar10Dataset 或 DiabeticRetinopathyDataset）和模型（wide resnet 或 resnet50 torch）。训练结束后，将保存的模型检查点输入到 Robustness Metrics 中进行评估。

在撰写本文时，我们总共提供了 83 个基线，包括 19 种涵盖标准的方法和超过 9 个基准的更新策略。

模块化。为了优化研究人员以轻松地在基线上进行实验（具体来说，将它们分叉），我们将基线设计为尽可能模块化并且具有最小的非标准依赖性。 API 方面，Uncertainty Baselines 几乎不提供抽象：数据集是 TensorFlow 数据集（TFDS 团队）的轻包装，模型是 Keras 模型，训练/测试逻辑在原始 TensorFlow 中（Abadi 等人，2015）这允许新用户更轻松地运行单个示例，或将我们的数据集和/或模型合并到他们的库中。对于分布外评估，我们将经过训练的模型插入稳健性指标（Djolonga 等人，2020 年）。图 1 说明了模块如何组合在一起

框架。不确定性基线与框架无关。数据集和指标模块与 NumPy 兼容，并以高性能方式与现代深度学习框架（包括 TensorFlow、Jax 和 PyTorch）进行互操作。例如，我们在 JFT-300M 数据集上的基线使用原始 JAX，并且我们在糖尿病视网膜病变数据集上包含 PyTorch Monte Carlo Dropout 基线。在实践中，为了便于代码和性能比较，我们为每个基准选择一个特定的后端，并在该后端（最常见的是 TensorFlow）下开发所有基线。我们的 Jax 和 PyTorch 基线表明，使用其他框架的实现是受支持的且直接的。

硬件。所有基线都在 CPU、GPU 或 Google Cloud TPU 上运行。基线针对默认硬件配置进行了优化，并且通常假设内存要求和芯片数量（例如，1 个 GPU 或 TPUv2-32）以重现结果。我们采用最新的编码实践来充分利用加速器芯片（图 2），因此研究人员可以利用最高性能的基线。

超参数。对于给定的基线，超参数和其他实验配置值很容易达到几十个。不确定性基线使用标准 Python 标志来指定超参数，设置默认值以重现最佳性能。标志很简单，不需要额外的框架，并且很容易插入其他管道或扩展。我们还记录了正确调整和评估基线的协议——论文中常见的差异来源。

再现性。所有模块都包括测试，并且所有结果都在多个种子上报告。在经过训练的模型上计算指标可能非常昂贵，更不用说从头开始训练了。因此，我们还提供了 TensorBoard 仪表板，其中包括所有训练、调整和评估指标。可以在此处找到示例。

图 2：使用 TensorFlow Profiler 对 TPUv3-32 上的 MIMO 基线进行性能分析。运行时经过优化，仅受模型操作的约束，这是给定基线的不可减少的瓶颈。我们的实现对 TPU 设备的利用率为 100%。

3 结果

为了提供不确定基线特征的示例，我们展示了 9 个任务中的 1 个可用的基线：ImageNet。图 3 显示了 8 个基线的准确性和校准误差，对分布内和分布外进行了评估。1 图 4 提供了将此类基线应用于下游任务的示例。总体而言，结果仅展示了存储库功能的一部分。我们很高兴看到新的研究已经建立在基线上。

图 2：使用 TensorFlow Profiler 对 TPUv3-32 上的 MIMO 基线进行性能分析。运行时经过优化，仅受模型操作的约束，这是给定基线的不可减少的瓶颈。我们的实现对 TPU 设备的利用率为 100%。

图 4：应用于延迟预测的 ImageNet 基线。在此任务中，根据模型的置信度（左）或所需的数据保留率（右）推迟预测。

参考文献

[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man ́e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi ́egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, URL https://www.tensorflow.org/. Software available from tensorflow.org. 2015.
[2] Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579, 2021.
[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622. PMLR, 2015.
[4] Luigi Carratino, Moustapha Ciss ́e, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup regularization. arXiv preprint arXiv:2006.06049, 2020.
[5] Mark Collier, Basil Mustafa, Efi Kokiopoulou, Rodolphe Jenatton, and Jesse Berent. Correlated input-dependent label noise in large-scale image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[6] Jigsaw Conversation AI. Toxic comment classification challenge. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge, 2017.
[7] Alexander DAmour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[9] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
[10] Josip Djolonga, Frances Hubis, Matthias Minderer, Zack Nado, Jeremy Nixon, Rob Romijnders, Dustin Tran, and Mario Lucic. Robustness Metrics,URL https: //github.com/google-research/robustness_metrics, 2020.
[11] Dheeru Dua and Casey Graff. UCI machine learning repository, URL http:// archive.ics.uci.edu/ml, 2017.
[12] Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, and Dustin Tran. Efficient and scalable bayesian neural nets with rank-1 factors. In International conference on machine learning, pages 2782–2792. PMLR, 2020a.
[13] Michael Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M Dai. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 204–213, 2020b.
[14] Sebastian Farquhar, Michael A Osborne, and Yarin Gal. Radial bayesian neural networks: Beyond discrete support in large-scale bayesian deep learning. In International Conference on Artificial Intelligence and Statistics, pages 1352–1362. PMLR, 2020.
[15] Angelos Filos, Sebastian Farquhar, Aidan N Gomez, Tim GJ Rudner, Zachary Kenton, Lewis Smith, Milad Alizadeh, Arnoud de Kroon, and Yarin Gal. A systematic comparison of bayesian deep learning robustness in diabetic retinopathy tasks. arXiv preprint arXiv:1912.10481, 2019.
[16] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
[17] Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M Dai, and Dustin Tran. Training independent subnetworks for robust prediction. arXiv preprint arXiv:2010.06610, 2020.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[19] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
[20] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
[21] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). a. URL http://www.cs.toronto.edu/~kriz/cifar.html.
[22] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research). b. URL http://www.cs.toronto.edu/~kriz/cifar.html.
[23] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A largescale study on regularization and normalization in gans. In International Conference on Machine Learning, pages 3581–3590. PMLR, 2019.
[24] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474, 2016.
[25] Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. arXiv preprint arXiv:1909.02027, 2019.
[26] Yann LeCun and Corinna Cortes. MNIST handwritten digit database, URL http: //yann.lecun.com/exdb/mnist/. 2010.
[27] Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. arXiv preprint arXiv:2006.10108, 2020.
[28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[29] Gabor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
[30] Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Dokania. The intriguing effects of focal loss on the calibration of deep neural networks. 2019.
[31] Zachary Nado, Justin M Gilmer, Christopher J Shallue, Rohan Anil, and George E Dahl. A large batch optimizer reality check: Traditional, generic optimizers suffice across batch sizes. arXiv preprint arXiv:2102.06356, 2021.
[32] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019.
[33] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[35] Koustuv Sinha, Joelle Pineau, Jessica Forde, Rosemary Nan Ke, and Hugo Larochelle. Neurips 2019 reproducibility challenge. ReScience C, 6(2):11, 2020.
[36] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
[37] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[38] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
[39] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715, 2020.
[40] Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles for robustness and uncertainty quantification. arXiv preprint arXiv:2006.13570, 2020.
[41] Ellery Wulczyn, Nithum Thain, and Lucas Dixon. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 1391–1399, Republic and Canton of Geneva, CHE,International World Wide Web Conferences Steering Committee. ISBN 9781450349130. doi: 10.1145/3038912. 3052591. URL https://doi.org/10.1145/3038912.3052591, 2017.
[42] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

Appendix A. Dataset Details

For CIFAR10 and CIFAR100, we padded the images with 4 pixels of 0’s before doing a random crop to 32x32 pixels, followed by a left-right flip with 50% chance. For ImageNet, we used ResNet preprocessing as described in He et al. (2016), but also support the common Inception preprocessing from Szegedy et al. (2015). All preprocessing is deterministic given a random seed, using tf.random.experimental.stateless split and tf.random.experimental.stateless fold in. For the Diabetic Retinopathy benchmarks we used the Kaggle competition dataset as in Filos et al. (2019).

Appendix B. Model Details

For CIFAR10 and CIFAR100 we provide methods based on the Wide ResNet models, typically the Wide ResNet-28 size (Zagoruyko and Komodakis, 2016). For ImageNet and the Diabetic Retinopathy benchmarks, we provide methods based on the ResNet-50 model (He et al., 2016). For ImageNet we additionally use methods based on the EfficientNet models (Tan and Le, 2019). For the Toxic Comments and CLINC Intent Detection benchmarks, our methods are based on the BERT-Base model (Devlin et al., 2018).

Appendix C. Hyperparameter Tuning

All image benchmarks were trained with Nesterov momentum (Sutskever et al., 2013), except for the EfficientNet models which use RMSProp with ρ = 0.9, = 10−3. The text benchmarks were trained with the AdamW optimizer (Loshchilov and Hutter, 2017) with a β2 = 0.999, = 10−6. Unless otherwise noted, the image benchmarks used a linear warmup followed by a stepwise decay schedule, except for the EfficientNet models which used a linear warmup followed by an exponential decay. The text benchmarks used a linear warmup followed by a linear decay.

For the CIFAR10, CIFAR100, ImageNet, Toxic Comments, and CLINC Intent Detection benchmarks, the papers for each method contain their tuning details.

Diabetic Retinopathy benchmark tuning details. For the Diabetic Retinopathy benchmark, we also provide our tuning results so that others can more easily retune their own methods. We conducted two rounds of quasirandom search on several hyperparameters (learning rate, momentum, dropout, variational posteriors, L2 regularization), where the first round was a heuristically-picked larger search space and the second round was a handtuned smaller range around the better performing values. Each round was for 50 trials, and the final hyperparameters were selected using the final validation AUC from the second tuning round. We finally retrained this best hyperparameter setting on the combined train and validation sets.