Improving generalization in large language models by learning prefix
subspaces
- URL: http://arxiv.org/abs/2310.15793v1
- Date: Tue, 24 Oct 2023 12:44:09 GMT
- Title: Improving generalization in large language models by learning prefix
subspaces
- Authors: Louis Falissard, Vincent Guigue, Laure Soulier
- Abstract summary: This article focuses on large language models (LLMs) fine-tuning in the scarce data regime (also known as the "few-shot" learning setting)
We propose a method to increase the generalization capabilities of LLMs based on neural network subspaces.
- Score: 5.911540700785975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article focuses on large language models (LLMs) fine-tuning in the
scarce data regime (also known as the "few-shot" learning setting). We propose
a method to increase the generalization capabilities of LLMs based on neural
network subspaces. This optimization method, recently introduced in computer
vision, aims to improve model generalization by identifying wider local optima
through the joint optimization of an entire simplex of models in parameter
space. Its adaptation to massive, pretrained transformers, however, poses some
challenges. First, their considerable number of parameters makes it difficult
to train several models jointly, and second, their deterministic parameter
initialization schemes make them unfit for the subspace method as originally
proposed. We show in this paper that "Parameter Efficient Fine-Tuning" (PEFT)
methods, however, are perfectly compatible with this original approach, and
propose to learn entire simplex of continuous prefixes. We test our method on a
variant of the GLUE benchmark adapted to the few-shot learning setting, and
show that both our contributions jointly lead to a gain in average performances
compared to sota methods. The implementation can be found at the following
link: https://github.com/Liloulou/prefix_subspace
Related papers
- Dynamic Subset Tuning: Expanding the Operational Range of Parameter-Efficient Training for Large Language Models [14.762222323897978]
We propose a novel parameter-efficient training (PET) method for large language models.
Unlike prior methods, this subset is not fixed in location but rather which parameters are modified over the course of training.
Our method enables a seamless scaling of the subset size across an arbitrary proportion of the total model size.
arXiv Detail & Related papers (2024-11-13T13:53:10Z) - Sparse Orthogonal Parameters Tuning for Continual Learning [34.462967722928724]
Continual learning methods based on pre-trained models (PTM) have recently gained attention which adapt to successive downstream tasks without catastrophic forgetting.
We propose a novel yet effective method called SoTU (Sparse Orthogonal Parameters TUning)
arXiv Detail & Related papers (2024-11-05T05:19:09Z) - Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks [24.935016443423233]
This study introduces a novel optimization approach, termed the emphfunctional homotopy method.
By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods.
We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20%-30%$ improvement in success rate over existing methods.
arXiv Detail & Related papers (2024-10-05T17:22:39Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Improving Hyperparameter Optimization with Checkpointed Model Weights [16.509585437768063]
In this work, we propose an HPO method for neural networks using logged checkpoints of the trained weights.
Our method, Forecasting Model Search (FMS), embeds weights into a Gaussian process deep kernel surrogate model.
arXiv Detail & Related papers (2024-06-26T17:59:54Z) - Rethinking Few-shot 3D Point Cloud Semantic Segmentation [62.80639841429669]
This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS)
We focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution.
To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built.
arXiv Detail & Related papers (2024-03-01T15:14:47Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Re-parameterizing Your Optimizers rather than Architectures [119.08740698936633]
We propose a novel paradigm of incorporating model-specific prior knowledge into Structurals and using them to train generic (simple) models.
As an implementation, we propose a novel methodology to add prior knowledge by modifying the gradients according to a set of model-specific hyper- parameters.
For a simple model trained with a Repr, we focus on a VGG-style plain model and showcase that such a simple model trained with a Repr, which is referred to as Rep-VGG, performs on par with the recent well-designed models.
arXiv Detail & Related papers (2022-05-30T16:55:59Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.