Exploring Mode Connectivity for Pre-trained Language Models
- URL: http://arxiv.org/abs/2210.14102v1
- Date: Tue, 25 Oct 2022 15:40:11 GMT
- Title: Exploring Mode Connectivity for Pre-trained Language Models
- Authors: Yujia Qin, Cheng Qian, Jing Yi, Weize Chen, Yankai Lin, Xu Han,
Zhiyuan Liu, Maosong Sun and Jie Zhou
- Abstract summary: We study how to effectively adapt pre-trained language models (PLMs) to high-performance minima.
In this paper, we investigate the geometric connections of different minima through the lens of mode connectivity.
- Score: 91.33378704580295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the prevalent application of pre-trained language
models (PLMs) in NLP. From the perspective of parameter space, PLMs provide
generic initialization, starting from which high-performance minima could be
found. Although plenty of works have studied how to effectively and efficiently
adapt PLMs to high-performance minima, little is known about the connection of
various minima reached under different adaptation configurations. In this
paper, we investigate the geometric connections of different minima through the
lens of mode connectivity, which measures whether two minima can be connected
with a low-loss path. We conduct empirical analyses to investigate three
questions: (1) how could hyperparameters, specific tuning methods, and training
data affect PLM's mode connectivity? (2) How does mode connectivity change
during pre-training? (3) How does the PLM's task knowledge change along the
path connecting two minima? In general, exploring the mode connectivity of PLMs
conduces to understanding the geometric connection of different minima, which
may help us fathom the inner workings of PLM downstream adaptation.
Related papers
- In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning [0.6650227510403052]
Multi-objective reinforcement learning (MORL) is essential for addressing the intricacies of real-world RL problems.
MORL is challenging due to unstable learning dynamics with deep learning-based function approximators.
Our work empirically explores model-free policy learning loss functions and the impact of different architectural choices.
arXiv Detail & Related papers (2024-07-23T19:17:47Z) - Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences [49.14535254003683]
PaLoRA is a novel parameter-efficient method that augments the original model with task-specific low-rank adapters.
Our experimental results show that PaLoRA outperforms MTL and PFL baselines across various datasets.
arXiv Detail & Related papers (2024-07-10T21:25:51Z) - MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation [80.47072100963017]
Model merging is an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into a multitask model.
Existing model-merging methods focus on enhancing average task accuracy.
We introduce a novel low-compute algorithm, Model Merging with Amortized Pareto Front (MAP)
arXiv Detail & Related papers (2024-06-11T17:55:25Z) - Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient
Tuning [9.38259062204602]
Large language models (LLMs) exhibit remarkable performance in language understanding and generation.
LLMs are continuously fine-tuned on complex and diverse domain-specific downstream tasks.
A trade-off needs to be kept between learning plasticity and memory stability.
arXiv Detail & Related papers (2024-02-29T05:27:45Z) - Learning to Learn with Indispensable Connections [6.040904021861969]
We propose a novel meta-learning method called Meta-LTH that includes indispensible (necessary) connections.
Our method improves the classification accuracy by approximately 2% (20-way 1-shot task setting) for omniglot dataset.
arXiv Detail & Related papers (2023-04-06T04:53:13Z) - LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of
Large Language Models [75.25782573728677]
This paper presents a framework for adapter-based parameter-efficient fine-tuning (PEFT) of language models (LLMs)
The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods.
We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning.
arXiv Detail & Related papers (2023-04-04T16:31:37Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - Contrastive and Non-Contrastive Self-Supervised Learning Recover Global
and Local Spectral Embedding Methods [19.587273175563745]
Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations.
This paper proposes a unifying framework under the helm of spectral manifold learning to address those limitations.
arXiv Detail & Related papers (2022-05-23T17:59:32Z) - Hybrid Relation Guided Set Matching for Few-shot Action Recognition [51.3308583226322]
We propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components.
The purpose of the hybrid relation module is to learn task-specific embeddings by fully exploiting associated relations within and cross videos in an episode.
We evaluate HyRSM on six challenging benchmarks, and the experimental results show its superiority over the state-of-the-art methods by a convincing margin.
arXiv Detail & Related papers (2022-04-28T11:43:41Z) - Multi-level Distance Regularization for Deep Metric Learning [20.178765779788492]
We propose a novel distance-based regularization method for deep metric learning called Multi-level Distance Regularization (MDR)
MDR explicitly disturbs a learning procedure by regularizing pairwise distances between embedding vectors into multiple levels.
By easily adopting our MDR, the previous approaches can be improved in performance and generalization ability.
arXiv Detail & Related papers (2021-02-08T14:16:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.