Related papers: Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

URL: http://arxiv.org/abs/2210.05709v1
Date: Tue, 11 Oct 2022 18:11:37 GMT
Title: Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers
Authors: William Held and Diyi Yang
Abstract summary: We show that it is possible to reduce interference by identifying and pruning language-specific parameters. We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
Score: 54.4919139401528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilingual transformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades, a phenomenon termed interference. Often attributed to limited model capacity, interference is commonly addressed by adding additional parameters despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific parameters. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction, seeing gains as large as 24.7\%. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization.

Related papers

A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z)
Language-Independent Representations Improve Zero-Shot Summarization [18.46817967804773]
Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. We first show naively finetuned models are highly language-specific in both output behavior and internal representations, resulting in poor zero-shot performance.
arXiv Detail & Related papers (2024-04-08T17:56:43Z)
Understanding the effects of language-specific class imbalance in multilingual fine-tuning [0.0]
We show that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with an imbalance leads to worse performance. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language.
arXiv Detail & Related papers (2024-02-20T13:59:12Z)
Roles of Scaling and Instruction Tuning in Language Perception: Model vs. Human Attention [58.817405319722596]
This work compares the self-attention of several large language models (LLMs) in different sizes to assess the effect of scaling and instruction tuning on language perception. Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not. We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models.
arXiv Detail & Related papers (2023-10-29T17:16:40Z)
Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks [12.7259425362286]
We investigate how multilingual models might leverage key-value memories. For autoregressive models trained on two or more languages, do all neurons (across layers) respond equally to all languages? Our findings reveal that the layers closest to the network's input or output tend to exhibit more language-specific behaviour compared to the layers in the middle.
arXiv Detail & Related papers (2023-10-24T06:45:00Z)
Lifting the Curse of Multilinguality by Pre-training Modular Transformers [72.46919537293068]
multilingual pre-trained models suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We introduce language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. Our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
arXiv Detail & Related papers (2022-05-12T17:59:56Z)
Language Model Priming for Cross-Lingual Event Extraction [1.8734449181723827]
We present a novel, language-agnostic approach to "priming" language models for the task of event extraction. We show that by enabling the language model to better compensate for the deficits of sparse and noisy training data, our approach improves both trigger and argument detection and classification significantly over the state of the art in a zero-shot cross-lingual setting.
arXiv Detail & Related papers (2021-09-25T15:19:32Z)
On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages. We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)
Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.