Knowledge is a Region in Weight Space for Fine-tuned Language Models
- URL: http://arxiv.org/abs/2302.04863v3
- Date: Thu, 12 Oct 2023 18:42:34 GMT
- Title: Knowledge is a Region in Weight Space for Fine-tuned Language Models
- Authors: Almog Gueta, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz,
Leshem Choshen
- Abstract summary: We study how the weight space and the underlying loss landscape of different models are interconnected.
We show that language models that have been finetuned on the same dataset form a tight cluster in the weight space, while models finetuned on different datasets from the same underlying task form a looser cluster.
- Score: 48.589822853418404
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Research on neural networks has focused on understanding a single model
trained on a single dataset. However, relatively little is known about the
relationships between different models, particularly those trained or tested on
different datasets. We address this by studying how the weight space and the
underlying loss landscape of different models are interconnected.
Specifically, we demonstrate that finetuned models that were optimized for
high performance, reside in well-defined regions in weight space, and vice
versa -- that any model that resides anywhere in those regions also exhibits
high performance. Notably, we show that language models that have been
finetuned on the same dataset form a tight cluster in the weight space, while
models finetuned on different datasets from the same underlying task form a
looser cluster. Moreover, traversing around the region between the models leads
to new models that perform comparably or even better than models obtained via
finetuning, even on tasks that the original models were not finetuned on.
Our findings provide insight into the relationships between models,
demonstrating that a model positioned between two similar models can acquire
the knowledge of both. We leverage this and design a method for selecting a
better model for efficient finetuning. Specifically, we show that starting from
the center of the region is as effective, if not more, than using the
pretrained model in 11 out of 12 datasets, resulting in an average accuracy
improvement of 3.06.
Related papers
- A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities.
Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z) - Model Selection with Model Zoo via Graph Learning [45.30615308692713]
We introduce TransferGraph, a novel framework that reformulates model selection as a graph learning problem.
We demonstrate TransferGraph's effectiveness in capturing essential model-dataset relationships, yielding up to a 32% improvement in correlation between predicted performance and the actual fine-tuning results compared to the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-05T09:50:00Z) - Transfer Learning with Point Transformers [3.678615604632945]
Point Transformers are state-of-the-art models for classification, segmentation, and detection on Point Cloud data.
We explore two things: classification performance of these attention based networks on ModelNet10 dataset and then, we use the trained model to classify 3D MNIST dataset after finetuning.
arXiv Detail & Related papers (2024-04-01T01:23:58Z) - A Two-Phase Recall-and-Select Framework for Fast Model Selection [13.385915962994806]
We propose a two-phase (coarse-recall and fine-selection) model selection framework.
It aims to enhance the efficiency of selecting a robust model by leveraging the models' training performances on benchmark datasets.
It has been demonstrated that the proposed methodology facilitates the selection of a high-performing model at a rate about 3x times faster than conventional baseline methods.
arXiv Detail & Related papers (2024-03-28T14:44:44Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Revealing Secrets From Pre-trained Models [2.0249686991196123]
Transfer-learning has been widely adopted in many emerging deep learning algorithms.
We show that pre-trained models and fine-tuned models have significantly high similarities in weight values.
We propose a new model extraction attack that reveals the model architecture and the pre-trained model used by the black-box victim model.
arXiv Detail & Related papers (2022-07-19T20:19:03Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.