Related papers: Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

URL: http://arxiv.org/abs/2310.16240v1
Date: Tue, 24 Oct 2023 23:29:06 GMT
Title: Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models
Authors: Raymond Li, Gabriel Murray and Giuseppe Carenini
Abstract summary: We propose a method that combines two popular research areas by injecting linguistic structures into pre-trained language models. In our approach, parallel adapter modules encoding different linguistic structures are combined using a novel Mixture-of-Linguistic-Experts architecture. Our experiment results show that our approach can outperform state-of-the-art PEFT methods with a comparable number of parameters.
Score: 22.977852629450346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we propose a method that combines two popular research areas by injecting linguistic structures into pre-trained language models in the parameter-efficient fine-tuning (PEFT) setting. In our approach, parallel adapter modules encoding different linguistic structures are combined using a novel Mixture-of-Linguistic-Experts architecture, where Gumbel-Softmax gates are used to determine the importance of these modules at each layer of the model. To reduce the number of parameters, we first train the model for a fixed small number of steps before pruning the experts based on their importance scores. Our experiment results with three different pre-trained models show that our approach can outperform state-of-the-art PEFT methods with a comparable number of parameters. In addition, we provide additional analysis to examine the experts selected by each model at each layer to provide insights for future studies.

Related papers

A Post-Training Enhanced Optimization Approach for Small Language Models [0.0]
This paper proposes a continuous post-training alignment data construction method for small language models. The core of this method is based on the data guidance of large models, optimizing the diversity and accuracy of alignment data.
arXiv Detail & Related papers (2024-11-05T09:32:26Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
Exploring the State-of-the-Art Language Modeling Methods and Data Augmentation Techniques for Multilingual Clause-Level Morphology [3.8498574327875947]
We present our work on all three parts of the shared task: inflection, reinflection, and analysis. We mainly explore two approaches: Transformer models in combination with data augmentation, and exploiting the state-of-the-art language modeling techniques for morphological analysis. Our methods achieved first place in each of the three tasks and outperforms mT5-baseline with 89% for inflection, 80% for reinflection and 12% for analysis.
arXiv Detail & Related papers (2022-11-03T11:53:39Z)
Analyzing Bagging Methods for Language Models [0.5161531917413708]
We perform an analysis of bagging language models and compare single language models to bagged ensembles that are roughly equivalent in terms of final model size. Our ensembling methods are at best roughly equivalent to single LM baselines.
arXiv Detail & Related papers (2022-07-19T06:30:37Z)
Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency [62.0887259003594]
This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency. Experiments on nine downstream tasks show several counter-intuitive phenomena. We present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference.
arXiv Detail & Related papers (2022-04-06T06:29:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.