Multilingual Augmenter: The Model Chooses
- URL: http://arxiv.org/abs/2102.09708v1
- Date: Fri, 19 Feb 2021 02:08:26 GMT
- Title: Multilingual Augmenter: The Model Chooses
- Authors: Matthew Ciolino, David Noever, Josh Kalin
- Abstract summary: We take an English sentence and translate it to another language before translating it back to English.
In this paper, we look at the effect of 108 different language back translations on various metrics and text embeddings.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language Processing (NLP) relies heavily on training data.
Transformers, as they have gotten bigger, have required massive amounts of
training data. To satisfy this requirement, text augmentation should be looked
at as a way to expand your current dataset and to generalize your models. One
text augmentation we will look at is translation augmentation. We take an
English sentence and translate it to another language before translating it
back to English. In this paper, we look at the effect of 108 different language
back translations on various metrics and text embeddings.
Related papers
- Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English.
A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training.
In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual
Understanding (XLU) [0.0]
We focus on improving the original XNLI dataset by re-translating the MNLI dataset in all of the 14 different languages present in XNLI.
We also perform experiments by training models in all 15 languages and analyzing their performance on the task of natural language inference.
arXiv Detail & Related papers (2023-01-16T17:24:57Z) - Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Improving Sentiment Analysis over non-English Tweets using Multilingual
Transformers and Automatic Translation for Data-Augmentation [77.69102711230248]
We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages.
Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
arXiv Detail & Related papers (2020-10-07T15:44:55Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.