Improving generalization in large language models by learning prefix
subspaces
- URL: http://arxiv.org/abs/2310.15793v1
- Date: Tue, 24 Oct 2023 12:44:09 GMT
- Title: Improving generalization in large language models by learning prefix
subspaces
- Authors: Louis Falissard, Vincent Guigue, Laure Soulier
- Abstract summary: This article focuses on large language models (LLMs) fine-tuning in the scarce data regime (also known as the "few-shot" learning setting)
We propose a method to increase the generalization capabilities of LLMs based on neural network subspaces.
- Score: 5.911540700785975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article focuses on large language models (LLMs) fine-tuning in the
scarce data regime (also known as the "few-shot" learning setting). We propose
a method to increase the generalization capabilities of LLMs based on neural
network subspaces. This optimization method, recently introduced in computer
vision, aims to improve model generalization by identifying wider local optima
through the joint optimization of an entire simplex of models in parameter
space. Its adaptation to massive, pretrained transformers, however, poses some
challenges. First, their considerable number of parameters makes it difficult
to train several models jointly, and second, their deterministic parameter
initialization schemes make them unfit for the subspace method as originally
proposed. We show in this paper that "Parameter Efficient Fine-Tuning" (PEFT)
methods, however, are perfectly compatible with this original approach, and
propose to learn entire simplex of continuous prefixes. We test our method on a
variant of the GLUE benchmark adapted to the few-shot learning setting, and
show that both our contributions jointly lead to a gain in average performances
compared to sota methods. The implementation can be found at the following
link: https://github.com/Liloulou/prefix_subspace
Related papers
- Sparse Orthogonal Parameters Tuning for Continual Learning [34.462967722928724]
Continual learning methods based on pre-trained models (PTM) have recently gained attention which adapt to successive downstream tasks without catastrophic forgetting.
We propose a novel yet effective method called SoTU (Sparse Orthogonal Parameters TUning)
arXiv Detail & Related papers (2024-11-05T05:19:09Z) - Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks [24.935016443423233]
This study introduces a novel optimization approach, termed the emphfunctional homotopy method.
By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods.
We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20%-30%$ improvement in success rate over existing methods.
arXiv Detail & Related papers (2024-10-05T17:22:39Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Improving Hyperparameter Optimization with Checkpointed Model Weights [16.509585437768063]
In this work, we propose an HPO method for neural networks using logged checkpoints of the trained weights.
Our method, Forecasting Model Search (FMS), embeds weights into a Gaussian process deep kernel surrogate model.
arXiv Detail & Related papers (2024-06-26T17:59:54Z) - CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs [44.03692512352445]
Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization.
In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization.
Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
arXiv Detail & Related papers (2024-05-27T14:49:39Z) - Rethinking Few-shot 3D Point Cloud Semantic Segmentation [62.80639841429669]
This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS)
We focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution.
To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built.
arXiv Detail & Related papers (2024-03-01T15:14:47Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Towards Universal Sequence Representation Learning for Recommender
Systems [98.02154164251846]
We present a novel universal sequence representation learning approach, named UniSRec.
The proposed approach utilizes the associated description text of items to learn transferable representations across different recommendation scenarios.
Our approach can be effectively transferred to new recommendation domains or platforms in a parameter-efficient way.
arXiv Detail & Related papers (2022-06-13T07:21:56Z) - Re-parameterizing Your Optimizers rather than Architectures [119.08740698936633]
We propose a novel paradigm of incorporating model-specific prior knowledge into Structurals and using them to train generic (simple) models.
As an implementation, we propose a novel methodology to add prior knowledge by modifying the gradients according to a set of model-specific hyper- parameters.
For a simple model trained with a Repr, we focus on a VGG-style plain model and showcase that such a simple model trained with a Repr, which is referred to as Rep-VGG, performs on par with the recent well-designed models.
arXiv Detail & Related papers (2022-05-30T16:55:59Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.