Related papers: nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?

nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?

URL: http://arxiv.org/abs/2106.02171v1
Date: Thu, 3 Jun 2021 23:12:27 GMT
Title: nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?
Authors: Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami Al-Rfou, Linting Xue
Abstract summary: We investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation is a straightforward way to improve performance.
Score: 9.560948239388662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.

Related papers

Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment [6.718469075779034]
We show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus can substantially improve representations for NLU tasks.<n>We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages.
arXiv Detail & Related papers (2026-02-25T03:58:24Z)
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora [85.44082712798553]
We introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks.<n>This dataset spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage.<n>Experiments show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
arXiv Detail & Related papers (2025-05-20T07:43:45Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z)
Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges. These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z)
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences [17.461172187276734]
This model builds upon the architecture of LongT5, while leveraging the multilingual datasets used for pretraining mT5 and the pretraining tasks of UL2. We evaluate this model on a variety of multilingual summarization and question-answering tasks, and the results show stronger performance for mLongT5 when compared to existing multilingual models such as mBART or M-BERT.
arXiv Detail & Related papers (2023-05-18T17:22:53Z)
PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation [5.004814662623874]
This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks.
arXiv Detail & Related papers (2023-04-03T18:19:26Z)
Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data. We propose two metrics for automatically removing such translations from the resulting datasets. In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z)
Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models. We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z)
Language Agnostic Multilingual Information Retrieval with Contrastive Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems. We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models. Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z)
Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models. We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks. We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z)
mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks. We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext. We show that multilingual translation models can be created through multilingual finetuning. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.