Linear Connectivity Reveals Generalization Strategies
- URL: http://arxiv.org/abs/2205.12411v1
- Date: Tue, 24 May 2022 23:43:02 GMT
- Title: Linear Connectivity Reveals Generalization Strategies
- Authors: Jeevesh Juneja and Rachit Bansal and Kyunghyun Cho and Jo\~ao Sedoc
and Naomi Saphra
- Abstract summary: Some pairs of finetuned models have large barriers of increasing loss on the linear paths between them.
We find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster.
Our work demonstrates how the geometry of the loss surface can guide models towards different functions.
- Score: 54.947772002394736
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: It is widely accepted in the mode connectivity literature that when two
neural networks are trained similarly on the same data, they are connected by a
path through parameter space over which test set accuracy is maintained. Under
some circumstances, including transfer learning from pretrained models, these
paths are presumed to be linear. In contrast to existing results, we find that
among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of
finetuned models have large barriers of increasing loss on the linear paths
between them. On each task, we find distinct clusters of models which are
linearly connected on the test loss surface, but are disconnected from models
outside the cluster -- models that occupy separate basins on the surface. By
measuring performance on specially-crafted diagnostic datasets, we find that
these clusters correspond to different generalization strategies: one cluster
behaves like a bag of words model under domain shift, while another cluster
uses syntactic heuristics. Our work demonstrates how the geometry of the loss
surface can guide models towards different heuristic functions.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Phantom Embeddings: Using Embedding Space for Model Regularization in
Deep Neural Networks [12.293294756969477]
The strength of machine learning models stems from their ability to learn complex function approximations from data.
The complex models tend to memorize the training data, which results in poor regularization performance on test data.
We present a novel approach to regularize the models by leveraging the information-rich latent embeddings and their high intra-class correlation.
arXiv Detail & Related papers (2023-04-14T17:15:54Z) - The Contextual Lasso: Sparse Linear Models via Deep Neural Networks [5.607237982617641]
We develop a new statistical estimator that fits a sparse linear model to the explanatory features such that the sparsity pattern and coefficients vary as a function of the contextual features.
An extensive suite of experiments on real and synthetic data suggests that the learned models, which remain highly transparent, can be sparser than the regular lasso.
arXiv Detail & Related papers (2023-02-02T05:00:29Z) - Neural Representations Reveal Distinct Modes of Class Fitting in
Residual Convolutional Networks [5.1271832547387115]
We leverage probabilistic models of neural representations to investigate how residual networks fit classes.
We find that classes in the investigated models are not fitted in an uniform way.
We show that the uncovered structure in neural representations correlate with robustness of training examples and adversarial memorization.
arXiv Detail & Related papers (2022-12-01T18:55:58Z) - Git Re-Basin: Merging Models modulo Permutation Symmetries [3.5450828190071655]
We show how simple algorithms can be used to fit large networks in practice.
We demonstrate the first (to our knowledge) demonstration of zero mode connectivity between independently trained models.
We also discuss shortcomings in the linear mode connectivity hypothesis.
arXiv Detail & Related papers (2022-09-11T10:44:27Z) - Dynamically-Scaled Deep Canonical Correlation Analysis [77.34726150561087]
Canonical Correlation Analysis (CCA) is a method for feature extraction of two views by finding maximally correlated linear projections of them.
We introduce a novel dynamic scaling method for training an input-dependent canonical correlation model.
arXiv Detail & Related papers (2022-03-23T12:52:49Z) - T-LoHo: A Bayesian Regularization Model for Structured Sparsity and
Smoothness on Graphs [0.0]
In graph-structured data, structured sparsity and smoothness tend to cluster together.
We propose a new prior for high dimensional parameters with graphical relations.
We use it to detect structured sparsity and smoothness simultaneously.
arXiv Detail & Related papers (2021-07-06T10:10:03Z) - GELATO: Geometrically Enriched Latent Model for Offline Reinforcement
Learning [54.291331971813364]
offline reinforcement learning approaches can be divided into proximal and uncertainty-aware methods.
In this work, we demonstrate the benefit of combining the two in a latent variational model.
Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data.
arXiv Detail & Related papers (2021-02-22T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.