Related papers: Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization

Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization

URL: http://arxiv.org/abs/2008.07545v4
Date: Mon, 19 Jul 2021 07:00:41 GMT
Title: Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization
Authors: Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer and Jascha Sohl-Dickstein
Abstract summary: We show that both data whitening and second order optimization can harm or entirely prevent generalization. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize.
Score: 50.53690793828442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second moment matrix of a dataset. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information, resulting in reduced or nonexistent generalization ability. We experimentally verify these predictions for several architectures, and further demonstrate that generalization continues to be harmed even when theoretical requirements are relaxed. However, we also show experimentally that regularized second order optimization can provide a practical tradeoff, where training is accelerated but less information is lost, and generalization can in some circumstances even improve.

Related papers

Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models [51.03144354630136]
Generalization in natural data domains is progressively achieved during training before the onset of memorization.<n>Generalization vs. memorization is then best understood as a competition between time scales.<n>We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules.
arXiv Detail & Related papers (2025-05-22T17:40:08Z)
Generalization Capability for Imitation Learning [1.30536490219656]
Imitation learning holds the promise of equipping robots with versatile skills by learning from expert demonstrations. However, policies trained on finite datasets often struggle to generalize beyond the training distribution. We present a unified perspective on the generalization capability of imitation learning, grounded in both information theorey and data distribution property.
arXiv Detail & Related papers (2025-04-25T17:59:59Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
HG-Adapter: Improving Pre-Trained Heterogeneous Graph Neural Networks with Dual Adapters [53.97380482341493]
"pre-train, prompt-tuning" has demonstrated impressive performance for tuning pre-trained heterogeneous graph neural networks (HGNNs) We propose a unified framework that combines two new adapters with potential labeled data extension to improve the generalization of pre-trained HGNN models.
arXiv Detail & Related papers (2024-11-02T06:43:54Z)
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views [28.081794908107604]
Fine-tuning is used to leverage the power of pre-trained foundation models in new downstream tasks. Recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions. We propose a novel generalizable fine-tuning method LEVI, where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model.
arXiv Detail & Related papers (2024-02-07T08:16:40Z)
Toward Theoretical Guidance for Two Common Questions in Practical Cross-Validation based Hyperparameter Selection [72.76113104079678]
We show the first theoretical treatments of two common questions in cross-validation based hyperparameter selection. We show that these generalizations can, respectively, always perform at least as well as always performing retraining or never performing retraining.
arXiv Detail & Related papers (2023-01-12T16:37:12Z)
Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications. Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy. We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)
Bi-tuning of Pre-trained Representations [79.58542780707441]
Bi-tuning is a general learning framework to fine-tune both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations. Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins.
arXiv Detail & Related papers (2020-11-12T03:32:25Z)
The Curious Case of Adversarially Robust Models: More Data Can Help, Double Descend, or Hurt Generalization [36.87923859576768]
Adversarial training has shown its ability in producing models that are robust to perturbations on the input data, but usually at the expense of decrease in the standard accuracy. In this paper, we show that more training data can hurt the generalization of adversarially robust models in the classification problems.
arXiv Detail & Related papers (2020-02-25T18:25:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.