Lite Training Strategies for Portuguese-English and English-Portuguese
Translation
- URL: http://arxiv.org/abs/2008.08769v1
- Date: Thu, 20 Aug 2020 04:31:03 GMT
- Title: Lite Training Strategies for Portuguese-English and English-Portuguese
Translation
- Authors: Alexandre Lopes, Rodrigo Nogueira, Roberto Lotufo, Helio Pedrini
- Abstract summary: We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks.
We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents.
Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
- Score: 67.4894325619275
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite the widespread adoption of deep learning for machine translation, it
is still expensive to develop high-quality translation models. In this work, we
investigate the use of pre-trained models, such as T5 for Portuguese-English
and English-Portuguese translation tasks using low-cost hardware. We explore
the use of Portuguese and English pre-trained language models and propose an
adaptation of the English tokenizer to represent Portuguese characters, such as
diaeresis, acute and grave accents. We compare our models to the Google
Translate API and MarianMT on a subset of the ParaCrawl dataset, as well as to
the winning submission to the WMT19 Biomedical Translation Shared Task. We also
describe our submission to the WMT20 Biomedical Translation Shared Task. Our
results show that our models have a competitive performance to state-of-the-art
models while being trained on modest hardware (a single 8GB gaming GPU for nine
days). Our data, models and code are available at
https://github.com/unicamp-dl/Lite-T5-Translation.
Related papers
- Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - TIM: Teaching Large Language Models to Translate with Comparison [78.66926087162672]
We propose a novel framework using examples in comparison to teach LLMs to learn translation.
Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning.
Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations.
arXiv Detail & Related papers (2023-07-10T08:15:40Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - "Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks [20.837515947519524]
First sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia.
In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data.
Our captioning results in Arabic are slightly better than that of its supervised model.
arXiv Detail & Related papers (2021-04-16T21:49:12Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
data [4.579262239784748]
We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese.
We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
arXiv Detail & Related papers (2020-08-20T18:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.