From Brazilian Portuguese to European Portuguese
- URL: http://arxiv.org/abs/2408.07457v1
- Date: Wed, 14 Aug 2024 10:58:48 GMT
- Title: From Brazilian Portuguese to European Portuguese
- Authors: João Sanches, Rui Ribeiro, Luísa Coheur,
- Abstract summary: Brazilian Portuguese and European Portuguese are two varieties of the same language.
There is a significant disproportion in the availability of resources between the two variants.
This inequity can impact the quality of translation services accessible to European Portuguese speakers.
- Score: 2.048226951354646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Brazilian Portuguese and European Portuguese are two varieties of the same language and, despite their close similarities, they exhibit several differences. However, there is a significant disproportion in the availability of resources between the two variants, with Brazilian Portuguese having more abundant resources. This inequity can impact the quality of translation services accessible to European Portuguese speakers. To address this issue, we propose the development of a Brazilian Portuguese to European Portuguese translation system, leveraging recent advancements in neural architectures and models. To evaluate the performance of such systems, we manually curated a gold test set comprising 500 sentences across five different topics. Each sentence in the gold test set has two distinct references, facilitating a straightforward evaluation of future translation models. We experimented with various models by fine-tuning existing Large Language Models using parallel data extracted from movie subtitles and TED Talks transcripts in both Brazilian and European Portuguese. Our evaluation involved the use of conventional automatic metrics as well as a human evaluation. In addition, all models were compared against ChatGPT 3.5 Turbo, which currently yields the best results.
Related papers
- PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks.
To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine.
The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z) - Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - Transformers and Transfer Learning for Improving Portuguese Semantic
Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data.
We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z) - Translating Similar Languages: Role of Mutual Intelligibility in
Multilingual Transformers [8.9379057739817]
We investigate approaches to translate between similar languages under low resource conditions.
We submit Transformer-based bilingual and multilingual systems for all language pairs.
Our Spanish-Catalan model has the best performance of all the five language pairs.
arXiv Detail & Related papers (2020-11-10T10:58:38Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
data [4.579262239784748]
We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese.
We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
arXiv Detail & Related papers (2020-08-20T18:10:13Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.