Do Multilingual Neural Machine Translation Models Contain Language Pair
Specific Attention Heads?
- URL: http://arxiv.org/abs/2105.14940v1
- Date: Mon, 31 May 2021 13:15:55 GMT
- Title: Do Multilingual Neural Machine Translation Models Contain Language Pair
Specific Attention Heads?
- Authors: Zae Myung Kim, Laurent Besacier, Vassilina Nikoulina, Didier Schwab
- Abstract summary: This paper aims to analyze individual components of a multilingual neural translation (NMT) model.
We look at the encoder self-attention and encoder-decoder attention heads that are more specific to the translation of a certain language pair than others.
Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs.
- Score: 16.392272086563175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies on the analysis of the multilingual representations focus on
identifying whether there is an emergence of language-independent
representations, or whether a multilingual model partitions its weights among
different languages. While most of such work has been conducted in a
"black-box" manner, this paper aims to analyze individual components of a
multilingual neural translation (NMT) model. In particular, we look at the
encoder self-attention and encoder-decoder attention heads (in a many-to-one
NMT model) that are more specific to the translation of a certain language pair
than others by (1) employing metrics that quantify some aspects of the
attention weights such as "variance" or "confidence", and (2) systematically
ranking the importance of attention heads with respect to translation quality.
Experimental results show that surprisingly, the set of most important
attention heads are very similar across the language pairs and that it is
possible to remove nearly one-third of the less important heads without hurting
the translation quality greatly.
Related papers
- Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Informative Language Representation Learning for Massively Multilingual
Neural Machine Translation [47.19129812325682]
In a multilingual neural machine translation model, an artificial language token is usually used to guide translation into the desired target language.
Recent studies show that prepending language tokens sometimes fails to navigate the multilingual neural machine translation models into right translation directions.
We propose two methods, language embedding embodiment and language-aware multi-head attention, to learn informative language representations to channel translation into right directions.
arXiv Detail & Related papers (2022-09-04T04:27:17Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Contributions of Transformer Attention Heads in Multi- and Cross-lingual
Tasks [9.913751245347429]
We show that pruning a number of attention heads in a multi-lingual Transformer-based model has, in general, positive effects on its performance in cross-lingual and multi-lingual tasks.
For comprehensiveness, we examine two pre-trained multi-lingual models, namely multi-lingual BERT (mBERT) and XLM-R, on three tasks across 9 languages each.
arXiv Detail & Related papers (2021-08-18T20:17:46Z) - Importance-based Neuron Allocation for Multilingual Neural Machine
Translation [27.65375150324557]
We propose to divide the model neurons into general and language-specific parts based on their importance across languages.
The general part is responsible for preserving the general knowledge and participating in the translation of all the languages.
The language-specific part is responsible for preserving the language-specific knowledge and participating in the translation of some specific languages.
arXiv Detail & Related papers (2021-07-14T09:15:05Z) - First Align, then Predict: Understanding the Cross-Lingual Ability of
Multilingual BERT [2.2931318723689276]
Cross-lingual transfer emerges from fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning.
We show that multilingual BERT can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor.
While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be red during fine-tuning.
arXiv Detail & Related papers (2021-01-26T22:12:38Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.