A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
- URL: http://arxiv.org/abs/2401.05749v2
- Date: Wed, 5 Jun 2024 20:49:57 GMT
- Title: A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
- Authors: Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico,
- Abstract summary: We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT)
We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT.
Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.
- Score: 9.212370563134206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.
Related papers
- m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt [39.2728779674405]
We propose a framework to leverage the multimodal prompt to guide the Multimodal Multilingual neural Machine Translation (m3P)
Our method aims to minimize the representation distance of different languages by regarding the image as a central language.
Experimental results show that m3P outperforms previous text-only baselines and multilingual multimodal methods by a large margin.
arXiv Detail & Related papers (2024-03-26T10:04:24Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models.
We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z) - Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - Many-to-English Machine Translation Tools, Data, and Pretrained Models [19.49814793168753]
We present useful tools for machine translation research: MTData, NLCodec, and RTG.
We create a multilingual neural machine translation model capable of translating from 500 source languages to English.
arXiv Detail & Related papers (2021-04-01T06:55:12Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.