The Pitfalls of Simplicity Bias in Neural Networks
- URL: http://arxiv.org/abs/2006.07710v2
- Date: Wed, 28 Oct 2020 09:33:46 GMT
- Title: The Pitfalls of Simplicity Bias in Neural Networks
- Authors: Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain,
Praneeth Netrapalli
- Abstract summary: Simplicity Bias (SB) is the tendency of standard training procedures to find simple models.
We propose datasets that incorporate a precise notion of simplicity, multiple predictive features, and capture the non-robustness of neural networks trained on real data.
- Score: 39.29115335858375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several works have proposed Simplicity Bias (SB)---the tendency of standard
training procedures such as Stochastic Gradient Descent (SGD) to find simple
models---to justify why neural networks generalize well [Arpit et al. 2017,
Nakkiran et al. 2019, Soudry et al. 2018]. However, the precise notion of
simplicity remains vague. Furthermore, previous settings that use SB to
theoretically justify why neural networks generalize well do not simultaneously
capture the non-robustness of neural networks---a widely observed phenomenon in
practice [Goodfellow et al. 2014, Jo and Bengio 2017]. We attempt to reconcile
SB and the superior standard generalization of neural networks with the
non-robustness observed in practice by designing datasets that (a) incorporate
a precise notion of simplicity, (b) comprise multiple predictive features with
varying levels of simplicity, and (c) capture the non-robustness of neural
networks trained on real data. Through theory and empirics on these datasets,
we make four observations: (i) SB of SGD and variants can be extreme: neural
networks can exclusively rely on the simplest feature and remain invariant to
all predictive complex features. (ii) The extreme aspect of SB could explain
why seemingly benign distribution shifts and small adversarial perturbations
significantly degrade model performance. (iii) Contrary to conventional wisdom,
SB can also hurt generalization on the same data distribution, as SB persists
even when the simplest feature has less predictive power than the more complex
features. (iv) Common approaches to improve generalization and
robustness---ensembles and adversarial training---can fail in mitigating SB and
its pitfalls. Given the role of SB in training neural networks, we hope that
the proposed datasets and methods serve as an effective testbed to evaluate
novel algorithmic approaches aimed at avoiding the pitfalls of SB.
Related papers
- Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize [5.642322814965062]
Learning representations that generalize under distribution shifts is critical for building robust machine learning models.
We show that even allowing a neural network to explicitly fit the representations obtained from a teacher network that can generalize out-of-distribution is insufficient for the generalization of the student network.
arXiv Detail & Related papers (2024-06-05T15:04:27Z) - Deep Neural Networks Tend To Extrapolate Predictably [51.303814412294514]
neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs.
We observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD.
We show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.
arXiv Detail & Related papers (2023-10-02T03:25:32Z) - Utility-Probability Duality of Neural Networks [4.871730595406078]
We propose an alternative utility-based explanation to the standard supervised learning procedure in deep learning.
The basic idea is to interpret the learned neural network not as a probability model but as an ordinal utility function.
We show that for all neural networks with softmax outputs, the SGD learning dynamic of maximum likelihood estimation can be seen as an iteration process.
arXiv Detail & Related papers (2023-05-24T08:09:07Z) - Generalization and Estimation Error Bounds for Model-based Neural
Networks [78.88759757988761]
We show that the generalization abilities of model-based networks for sparse recovery outperform those of regular ReLU networks.
We derive practical design rules that allow to construct model-based networks with guaranteed high generalization.
arXiv Detail & Related papers (2023-04-19T16:39:44Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Gated Information Bottleneck for Generalization in Sequential
Environments [13.795129636387623]
Deep neural networks suffer from poor generalization to unseen environments when the underlying data distribution is different from that in the training set.
We propose a new neural network-based IB approach, termed gated information bottleneck (GIB)
We empirically demonstrate the superiority of GIB over other popular neural network-based IB approaches in adversarial robustness and out-of-distribution detection.
arXiv Detail & Related papers (2021-10-12T14:58:38Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Path classification by stochastic linear recurrent neural networks [2.5499055723658097]
We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks.
We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data.
arXiv Detail & Related papers (2021-08-06T12:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.