Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in
Neural Networks
- URL: http://arxiv.org/abs/2210.01360v1
- Date: Tue, 4 Oct 2022 04:01:15 GMT
- Title: Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in
Neural Networks
- Authors: Sravanti Addepalli, Anshul Nasery, R. Venkatesh Babu, Praneeth
Netrapalli, Prateek Jain
- Abstract summary: We investigate why diverse/ complex features are learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features.
We propose Feature Reconstruction Regularizer (FRR) to ensure that the learned features can be reconstructed back from the logits.
We demonstrate up to 15% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts.
- Score: 66.76034024335833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Neural Networks are known to be brittle to even minor distribution
shifts compared to the training distribution. While one line of work has
demonstrated that Simplicity Bias (SB) of DNNs - bias towards learning only the
simplest features - is a key reason for this brittleness, another recent line
of work has surprisingly found that diverse/ complex features are indeed
learned by the backbone, and their brittleness is due to the linear
classification head relying primarily on the simplest features. To bridge the
gap between these two lines of work, we first hypothesize and verify that while
SB may not altogether preclude learning complex features, it amplifies simpler
features over complex ones. Namely, simple features are replicated several
times in the learned representations while complex features might not be
replicated. This phenomenon, we term Feature Replication Hypothesis, coupled
with the Implicit Bias of SGD to converge to maximum margin solutions in the
feature space, leads the models to rely mostly on the simple features for
classification. To mitigate this bias, we propose Feature Reconstruction
Regularizer (FRR) to ensure that the learned features can be reconstructed back
from the logits. The use of {\em FRR} in linear layer training (FRR-L)
encourages the use of more diverse features for classification. We further
propose to finetune the full network by freezing the weights of the linear
layer trained using FRR-L, to refine the learned features, making them more
suitable for classification. Using this simple solution, we demonstrate up to
15% gains in OOD accuracy on the recently introduced semi-synthetic datasets
with extreme distribution shifts. Moreover, we demonstrate noteworthy gains
over existing SOTA methods on the standard OOD benchmark DomainBed as well.
Related papers
- Simplicity Bias via Global Convergence of Sharpness Minimization [43.658859631741024]
We show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks.
We also find a novel property of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss.
arXiv Detail & Related papers (2024-10-21T18:10:37Z) - Simplicity Bias of Two-Layer Networks beyond Linearly Separable Data [4.14360329494344]
We characterize simplicity bias for general datasets in the context of two-layer neural networks with small weights and trained with gradient flow.
For datasets with an XOR-like pattern, we precisely identify the learned features and demonstrate that simplicity bias intensifies during later training stages.
These results indicate that features learned in the middle stages of training may be more useful for OOD transfer.
arXiv Detail & Related papers (2024-05-27T16:00:45Z) - Neural Redshift: Random Networks are not Random Functions [28.357640341268745]
We show that NNs do not have an inherent "simplicity bias"
Alternative architectures can be built with a bias for any level of complexity.
It points to promising avenues for controlling the solutions implemented by trained models.
arXiv Detail & Related papers (2024-03-04T17:33:20Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Federated Latent Class Regression for Hierarchical Data [5.110894308882439]
Federated Learning (FL) allows a number of agents to participate in training a global machine learning model without disclosing locally stored data.
We propose a novel probabilistic model, Hierarchical Latent Class Regression (HLCR), and its extension to Federated Learning, FEDHLCR.
Our inference algorithm, being derived from Bayesian theory, provides strong convergence guarantees and good robustness to overfitting. Experimental results show that FEDHLCR offers fast convergence even in non-IID datasets.
arXiv Detail & Related papers (2022-06-22T00:33:04Z) - Evading the Simplicity Bias: Training a Diverse Set of Models Discovers
Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features.
This simplicity bias can explain their lack of robustness out of distribution (OOD)
We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z) - Embedding Propagation: Smoother Manifold for Few-Shot Classification [131.81692677836202]
We propose to use embedding propagation as an unsupervised non-parametric regularizer for manifold smoothing in few-shot classification.
We empirically show that embedding propagation yields a smoother embedding manifold.
We show that embedding propagation consistently improves the accuracy of the models in multiple semi-supervised learning scenarios by up to 16% points.
arXiv Detail & Related papers (2020-03-09T13:51:09Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.