Contributions of Transformer Attention Heads in Multi- and Cross-lingual
Tasks
- URL: http://arxiv.org/abs/2108.08375v1
- Date: Wed, 18 Aug 2021 20:17:46 GMT
- Title: Contributions of Transformer Attention Heads in Multi- and Cross-lingual
Tasks
- Authors: Weicheng Ma, Kai Zhang, Renze Lou, Lili Wang, Soroush Vosoughi
- Abstract summary: We show that pruning a number of attention heads in a multi-lingual Transformer-based model has, in general, positive effects on its performance in cross-lingual and multi-lingual tasks.
For comprehensiveness, we examine two pre-trained multi-lingual models, namely multi-lingual BERT (mBERT) and XLM-R, on three tasks across 9 languages each.
- Score: 9.913751245347429
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper studies the relative importance of attention heads in
Transformer-based models to aid their interpretability in cross-lingual and
multi-lingual tasks. Prior research has found that only a few attention heads
are important in each mono-lingual Natural Language Processing (NLP) task and
pruning the remaining heads leads to comparable or improved performance of the
model. However, the impact of pruning attention heads is not yet clear in
cross-lingual and multi-lingual tasks. Through extensive experiments, we show
that (1) pruning a number of attention heads in a multi-lingual
Transformer-based model has, in general, positive effects on its performance in
cross-lingual and multi-lingual tasks and (2) the attention heads to be pruned
can be ranked using gradients and identified with a few trial experiments. Our
experiments focus on sequence labeling tasks, with potential applicability on
other cross-lingual and multi-lingual tasks. For comprehensiveness, we examine
two pre-trained multi-lingual models, namely multi-lingual BERT (mBERT) and
XLM-R, on three tasks across 9 languages each. We also discuss the validity of
our findings and their extensibility to truly resource-scarce languages and
other task settings.
Related papers
- Analyzing the Evaluation of Cross-Lingual Knowledge Transfer in
Multilingual Language Models [12.662039551306632]
We show that observed high performance of multilingual models can be largely attributed to factors not requiring the transfer of actual linguistic knowledge.
More specifically, we observe what has been transferred across languages is mostly data artifacts and biases, especially for low-resource languages.
arXiv Detail & Related papers (2024-02-03T09:41:52Z) - Team QUST at SemEval-2023 Task 3: A Comprehensive Study of Monolingual
and Multilingual Approaches for Detecting Online News Genre, Framing and
Persuasion Techniques [0.030458514384586396]
This paper describes the participation of team QUST in the SemEval2023 task 3.
The monolingual models are first evaluated with the under-sampling of the majority classes.
The pre-trained multilingual model is fine-tuned with a combination of the class weights and the sample weights.
arXiv Detail & Related papers (2023-04-09T08:14:01Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - Do Multilingual Neural Machine Translation Models Contain Language Pair
Specific Attention Heads? [16.392272086563175]
This paper aims to analyze individual components of a multilingual neural translation (NMT) model.
We look at the encoder self-attention and encoder-decoder attention heads that are more specific to the translation of a certain language pair than others.
Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs.
arXiv Detail & Related papers (2021-05-31T13:15:55Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - First Align, then Predict: Understanding the Cross-Lingual Ability of
Multilingual BERT [2.2931318723689276]
Cross-lingual transfer emerges from fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning.
We show that multilingual BERT can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor.
While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be red during fine-tuning.
arXiv Detail & Related papers (2021-01-26T22:12:38Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.