Related papers: From Brazilian Portuguese to European Portuguese

From Brazilian Portuguese to European Portuguese

URL: http://arxiv.org/abs/2408.07457v1
Date: Wed, 14 Aug 2024 10:58:48 GMT
Title: From Brazilian Portuguese to European Portuguese
Authors: João Sanches, Rui Ribeiro, Luísa Coheur,
Abstract summary: Brazilian Portuguese and European Portuguese are two varieties of the same language. There is a significant disproportion in the availability of resources between the two variants. This inequity can impact the quality of translation services accessible to European Portuguese speakers.
Score: 2.048226951354646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Brazilian Portuguese and European Portuguese are two varieties of the same language and, despite their close similarities, they exhibit several differences. However, there is a significant disproportion in the availability of resources between the two variants, with Brazilian Portuguese having more abundant resources. This inequity can impact the quality of translation services accessible to European Portuguese speakers. To address this issue, we propose the development of a Brazilian Portuguese to European Portuguese translation system, leveraging recent advancements in neural architectures and models. To evaluate the performance of such systems, we manually curated a gold test set comprising 500 sentences across five different topics. Each sentence in the gold test set has two distinct references, facilitating a straightforward evaluation of future translation models. We experimented with various models by fine-tuning existing Large Language Models using parallel data extracted from movie subtitles and TED Talks transcripts in both Brazilian and European Portuguese. Our evaluation involved the use of conventional automatic metrics as well as a human evaluation. In addition, all models were compared against ChatGPT 3.5 Turbo, which currently yields the best results.

Related papers

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z)
M-Prometheus: A Suite of Open Multilingual LLM Judges [64.22940792713713]
We introduce M-Prometheus, a suite of open-weight LLM judges that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs.
arXiv Detail & Related papers (2025-04-07T11:37:26Z)
Enhancing Portuguese Variety Identification with Cross-Domain Approaches [2.31011809034817]
We develop a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages.
arXiv Detail & Related papers (2025-02-20T09:31:48Z)
Tradutor: Building a Variety Specific Translation Model [3.976102757693942]
We introduce the first open-source translation model specifically tailored for European Portuguese. Our best model surpasses existing open-source translation systems for Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research.
arXiv Detail & Related papers (2025-02-20T09:20:59Z)
Tucano: Advancing Neural Text Generation for Portuguese [0.0]
This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks.
arXiv Detail & Related papers (2024-11-12T15:06:06Z)
LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs) The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z)
PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z)
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z)
Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z)
Transformers and Transfer Learning for Improving Portuguese Semantic Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data. We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z)
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers [8.9379057739817]
We investigate approaches to translate between similar languages under low resource conditions. We submit Transformer-based bilingual and multilingual systems for all language pairs. Our Spanish-Catalan model has the best performance of all the five language pairs.
arXiv Detail & Related papers (2020-11-10T10:58:38Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data [4.579262239784748]
We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese. We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
arXiv Detail & Related papers (2020-08-20T18:10:13Z)
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.