Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers
- URL: http://arxiv.org/abs/2210.05709v1
- Date: Tue, 11 Oct 2022 18:11:37 GMT
- Title: Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers
- Authors: William Held and Diyi Yang
- Abstract summary: We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
- Score: 54.4919139401528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual transformer-based models demonstrate remarkable zero and
few-shot transfer across languages by learning and reusing language-agnostic
features. However, as a fixed-size model acquires more languages, its
performance across all languages degrades, a phenomenon termed interference.
Often attributed to limited model capacity, interference is commonly addressed
by adding additional parameters despite evidence that transformer-based models
are overparameterized. In this work, we show that it is possible to reduce
interference by instead identifying and pruning language-specific parameters.
First, we use Shapley Values, a credit allocation metric from coalitional game
theory, to identify attention heads that introduce interference. Then, we show
that removing identified attention heads from a fixed model improves
performance for a target language on both sentence classification and
structural prediction, seeing gains as large as 24.7\%. Finally, we provide
insights on language-agnostic and language-specific attention heads using
attention visualization.
Related papers
- A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism.
Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model.
We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z) - Language-Independent Representations Improve Zero-Shot Summarization [18.46817967804773]
Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions.
In this work, we focus on summarization and tackle the problem through the lens of language-independent representations.
We first show naively finetuned models are highly language-specific in both output behavior and internal representations, resulting in poor zero-shot performance.
arXiv Detail & Related papers (2024-04-08T17:56:43Z) - Understanding the effects of language-specific class imbalance in
multilingual fine-tuning [0.0]
We show that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with an imbalance leads to worse performance.
We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language.
arXiv Detail & Related papers (2024-02-20T13:59:12Z) - Roles of Scaling and Instruction Tuning in Language Perception: Model
vs. Human Attention [58.817405319722596]
This work compares the self-attention of several large language models (LLMs) in different sizes to assess the effect of scaling and instruction tuning on language perception.
Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not.
We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models.
arXiv Detail & Related papers (2023-10-29T17:16:40Z) - Unveiling Multilinguality in Transformer Models: Exploring Language
Specificity in Feed-Forward Networks [12.7259425362286]
We investigate how multilingual models might leverage key-value memories.
For autoregressive models trained on two or more languages, do all neurons (across layers) respond equally to all languages?
Our findings reveal that the layers closest to the network's input or output tend to exhibit more language-specific behaviour compared to the layers in the middle.
arXiv Detail & Related papers (2023-10-24T06:45:00Z) - Lifting the Curse of Multilinguality by Pre-training Modular
Transformers [72.46919537293068]
multilingual pre-trained models suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages.
We introduce language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.
Our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
arXiv Detail & Related papers (2022-05-12T17:59:56Z) - Language Model Priming for Cross-Lingual Event Extraction [1.8734449181723827]
We present a novel, language-agnostic approach to "priming" language models for the task of event extraction.
We show that by enabling the language model to better compensate for the deficits of sparse and noisy training data, our approach improves both trigger and argument detection and classification significantly over the state of the art in a zero-shot cross-lingual setting.
arXiv Detail & Related papers (2021-09-25T15:19:32Z) - On Negative Interference in Multilingual Models: Findings and A
Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages.
We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.