Related papers: Composing Parameter-Efficient Modules with Arithmetic Operations

Composing Parameter-Efficient Modules with Arithmetic Operations

URL: http://arxiv.org/abs/2306.14870v2
Date: Sat, 9 Dec 2023 02:46:08 GMT
Title: Composing Parameter-Efficient Modules with Arithmetic Operations
Authors: Jinghan Zhang, Shiqi Chen, Junteng Liu, Junxian He
Abstract summary: We propose to compose parameter-efficient modules through linear arithmetic operations in the weight space. Our approach requires emphno additional training and enables highly flexible module composition. We extend our approach to detoxify Alpaca-LoRA, the latest instruction-tuned large language model based on LLaMA.
Score: 20.119291936493788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As an efficient alternative to conventional full finetuning, parameter-efficient finetuning (PEFT) is becoming the prevailing method to adapt pretrained language models. In PEFT, a lightweight module is learned on each dataset while the underlying pretrained language model remains unchanged, resulting in multiple compact modules representing diverse skills when applied to various domains and tasks. In this paper, we propose to compose these parameter-efficient modules through linear arithmetic operations in the weight space, thereby integrating different module capabilities. Specifically, we first define addition and negation operators for the module, and then further compose these two basic operators to perform flexible arithmetic. Our approach requires \emph{no additional training} and enables highly flexible module composition. We apply different arithmetic operations to compose the parameter-efficient modules for (1) distribution generalization, (2) multi-tasking, (3) unlearning, and (4) domain transfer. Additionally, we extend our approach to detoxify Alpaca-LoRA, the latest instruction-tuned large language model based on LLaMA. Empirical results demonstrate that our approach produces new and effective parameter-efficient modules that significantly outperform existing ones across all settings.

Related papers

Learning to Chain Operations by Routing Information Through a Global Workspace [3.1614158472531435]
We present a model inspired by the Global Workspace Theory that integrates specialized modules to perform a sequential reasoning task. We evaluate the model's performance on a simple addition task, where two addends must be summed. Our results highlight the potential of architectures inspired by the Global Workspace Theory to enhance deep learning's reasoning capabilities.
arXiv Detail & Related papers (2025-02-28T15:30:55Z)
Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z)
Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models [56.93608812478369]
We present L2R, a method that isolates the training of new PEFT modules to ensure their task specialization. L2R then learns to compose the learned modules by training a network of routers that leverages a small memory containing examples of previously seen tasks. Our results demonstrate that L2R provides an effective composition of PEFT modules, leading to improved generalization and performance compared to other methods.
arXiv Detail & Related papers (2024-08-16T23:57:29Z)
Mixture of Experts Using Tensor Products [44.816454454687]
In multi-task learning, the conventional approach involves training a model on multiple tasks simultaneously. We investigate if modular language models can facilitate positive transfer and systematic generalization. Specifically, we propose a novel modular language model (textttTensorPoly) that balances parameter efficiency with nuanced routing methods.
arXiv Detail & Related papers (2024-05-26T19:25:08Z)
Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models [31.960749305728488]
We introduce a novel concept dubbed modular neural tangent kernel (mNTK) We show that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $lambda_max$. We propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $lambda_max$ exceeding a dynamic threshold.
arXiv Detail & Related papers (2024-05-13T07:46:48Z)
Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs. We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z)
ModuleFormer: Modularity Emerges from Mixture-of-Experts [60.6148988099284]
This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models. Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
arXiv Detail & Related papers (2023-06-07T17:59:57Z)
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning [64.638804236566]
We propose a unified framework, UniPELT, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup. Remarkably, on the GLUE benchmark, UniPELT consistently achieves 13pt gains compared to the best individual PELT method that it incorporates and even outperforms fine-tuning under different setups.
arXiv Detail & Related papers (2021-10-14T17:40:08Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.