Empirical Study on Optimizer Selection for Out-of-Distribution
Generalization
- URL: http://arxiv.org/abs/2211.08583v3
- Date: Mon, 5 Jun 2023 22:23:52 GMT
- Title: Empirical Study on Optimizer Selection for Out-of-Distribution
Generalization
- Authors: Hiroki Naganuma, Kartik Ahuja, Shiro Takagi, Tetsuya Motokawa, Rio
Yokota, Kohta Ishikawa, Ikuro Sato, Ioannis Mitliagkas
- Abstract summary: Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution.
In this study, we examine the performance of popular first-order generalizations for different classes of distributional shift.
- Score: 16.386766049451317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern deep learning systems do not generalize well when the test data
distribution is slightly different to the training data distribution. While
much promising work has been accomplished to address this fragility, a
systematic study of the role of optimizers and their out-of-distribution
generalization performance has not been undertaken. In this study, we examine
the performance of popular first-order optimizers for different classes of
distributional shift under empirical risk minimization and invariant risk
minimization. We address this question for image and text classification using
DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different
types of shifts -- namely correlation and diversity shift. We search over a
wide range of hyperparameters and examine classification accuracy
(in-distribution and out-of-distribution) for over 20,000 models. We arrive at
the following findings, which we expect to be helpful for practitioners: i)
adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers
(e.g., SGD, momentum SGD) on out-of-distribution performance. In particular,
even though there is no significant difference in in-distribution performance,
we show a measurable difference in out-of-distribution performance. ii)
in-distribution performance and out-of-distribution performance exhibit three
types of behavior depending on the dataset -- linear returns, increasing
returns, and diminishing returns. For example, in the training of natural
language data using Adam, fine-tuning the performance of in-distribution
performance does not significantly contribute to the out-of-distribution
generalization performance.
Related papers
- On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - Modeling Uncertain Feature Representation for Domain Generalization [49.129544670700525]
We show that our method consistently improves the network generalization ability on multiple vision tasks.
Our methods are simple yet effective and can be readily integrated into networks without additional trainable parameters or loss constraints.
arXiv Detail & Related papers (2023-01-16T14:25:02Z) - An Empirical Study on Distribution Shift Robustness From the Perspective
of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation.
We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z) - Predicting with Confidence on Unseen Distributions [90.68414180153897]
We connect domain adaptation and predictive uncertainty literature to predict model accuracy on challenging unseen distributions.
We find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts.
We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference.
arXiv Detail & Related papers (2021-07-07T15:50:18Z) - Mind the Trade-off: Debiasing NLU Models without Degrading the
In-distribution Performance [70.31427277842239]
We introduce a novel debiasing method called confidence regularization.
It discourages models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples.
We evaluate our method on three NLU tasks and show that, in contrast to its predecessors, it improves the performance on out-of-distribution datasets.
arXiv Detail & Related papers (2020-05-01T11:22:55Z) - On the Benefits of Invariance in Neural Networks [56.362579457990094]
We show that training with data augmentation leads to better estimates of risk and thereof gradients, and we provide a PAC-Bayes generalization bound for models trained with data augmentation.
We also show that compared to data augmentation, feature averaging reduces generalization error when used with convex losses, and tightens PAC-Bayes bounds.
arXiv Detail & Related papers (2020-05-01T02:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.