Deep Transformers with Latent Depth
- URL: http://arxiv.org/abs/2009.13102v2
- Date: Fri, 16 Oct 2020 03:50:56 GMT
- Title: Deep Transformers with Latent Depth
- Authors: Xian Li, Asa Cooper Stickland, Yuqing Tang, and Xiang Kong
- Abstract summary: The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks.
We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection.
We propose a novel method to train one shared Transformer network for multilingual machine translation.
- Score: 42.33955275626127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer model has achieved state-of-the-art performance in many
sequence modeling tasks. However, how to leverage model capacity with large or
variable depths is still an open challenge. We present a probabilistic
framework to automatically learn which layer(s) to use by learning the
posterior distributions of layer selection. As an extension of this framework,
we propose a novel method to train one shared Transformer network for
multilingual machine translation with different layer selection posteriors for
each language pair. The proposed method alleviates the vanishing gradient issue
and enables stable training of deep Transformers (e.g. 100 layers). We evaluate
on WMT English-German machine translation and masked language modeling tasks,
where our method outperforms existing approaches for training deeper
Transformers. Experiments on multilingual machine translation demonstrate that
this approach can effectively leverage increased model capacity and bring
universal improvement for both many-to-one and one-to-many translation with
diverse language pairs.
Related papers
- Using Machine Translation to Augment Multilingual Classification [0.0]
We explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages.
We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
arXiv Detail & Related papers (2024-05-09T00:31:59Z) - Low-resource neural machine translation with morphological modeling [3.3721926640077804]
Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation.
We propose a framework-solution for modeling complex morphology in low-resource settings.
We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text.
arXiv Detail & Related papers (2024-04-03T01:31:41Z) - On the Pareto Front of Multilingual Neural Machine Translation [123.94355117635293]
We study how the performance of a given direction changes with its sampling ratio in Neural Machine Translation (MNMT)
We propose the Double Power Law to predict the unique performance trade-off front in MNMT.
In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget.
arXiv Detail & Related papers (2023-04-06T16:49:19Z) - Multilingual Transformer Encoders: a Word-Level Task-Agnostic Evaluation [0.6882042556551609]
Some Transformer-based models can perform cross-lingual transfer learning.
We propose a word-level task-agnostic method to evaluate the alignment of contextualized representations built by such models.
arXiv Detail & Related papers (2022-07-19T05:23:18Z) - Lightweight Cross-Lingual Sentence Representation Learning [57.9365829513914]
We introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations.
We propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task.
arXiv Detail & Related papers (2021-05-28T14:10:48Z) - Serial or Parallel? Plug-able Adapter for multilingual machine
translation [15.114588783601466]
We propose PAM, a Transformer model augmented with defusion adaptation for multilingual machine translation.
PAM consists of embedding and layer adapters to shift the word and intermediate representations towards language-specific ones.
Experiment results on IWSLT, OPUS-100, and WMT benchmarks show that method outperforms several strong competitors.
arXiv Detail & Related papers (2021-04-16T14:58:28Z) - SML: a new Semantic Embedding Alignment Transformer for efficient
cross-lingual Natural Language Inference [71.57324258813674]
The ability of Transformers to perform with precision a variety of tasks such as question answering, Natural Language Inference (NLI) or summarising, have enable them to be ranked as one of the best paradigms to address this kind of tasks at present.
NLI is one of the best scenarios to test these architectures, due to the knowledge required to understand complex sentences and established a relation between a hypothesis and a premise.
In this paper, we propose a new architecture, siamese multilingual transformer, to efficiently align multilingual embeddings for Natural Language Inference.
arXiv Detail & Related papers (2021-03-17T13:23:53Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - Multi-layer Representation Fusion for Neural Machine Translation [38.12309528346962]
We propose a multi-layer representation fusion (MLRF) approach to fusing stacked layers.
In particular, we design three fusion functions to learn a better representation from the stack.
The result is new state-of-the-art in German-English translation.
arXiv Detail & Related papers (2020-02-16T23:53:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.