Related papers: Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

URL: http://arxiv.org/abs/2305.06721v2
Date: Tue, 20 Jun 2023 15:22:58 GMT
Title: Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
Authors: Jo\~ao Rodrigues, Lu\'is Gomes, Jo\~ao Silva, Ant\'onio Branco, Rodrigo Santos, Henrique Lopes Cardoso, Tom\'as Os\'orio
Abstract summary: Albertina PT-* is a foundation model that sets a new state of the art for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR) The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware.
Score: 0.5937476291232802
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.

Related papers

Apple Intelligence Foundation Language Models: Tech Report 2025 [246.04717786298764]
We introduce two foundation language models that power Apple Intelligence features across Apple devices and services.<n>Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling.<n>A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning.
arXiv Detail & Related papers (2025-07-17T23:37:19Z)
Enhancing Portuguese Variety Identification with Cross-Domain Approaches [2.31011809034817]
We develop a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages.
arXiv Detail & Related papers (2025-02-20T09:31:48Z)
From Brazilian Portuguese to European Portuguese [2.048226951354646]
Brazilian Portuguese and European Portuguese are two varieties of the same language. There is a significant disproportion in the availability of resources between the two variants. This inequity can impact the quality of translation services accessible to European Portuguese speakers.
arXiv Detail & Related papers (2024-08-14T10:58:48Z)
PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z)
Advancing Generative AI for Portuguese with Open Decoder Gerv\'asio PT* [0.38570000254272757]
We present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in neural decoding of Portuguese. All versions of Gerv'asio are open source and distributed for free under an open license, including for either research or commercial usage.
arXiv Detail & Related papers (2024-02-29T00:19:13Z)
On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation [63.914940899327966]
Pre-training (PT) and back-translation (BT) are two simple and powerful methods to utilize monolingual data. This paper takes the first step to investigate the complementarity between PT and BT. We establish state-of-the-art performances on the WMT16 English-Romanian and English-Russian benchmarks.
arXiv Detail & Related papers (2021-10-05T04:01:36Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language. Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English. We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z)
Transformers and Transfer Learning for Improving Portuguese Semantic Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data. We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z)
PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data [4.579262239784748]
We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese. We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
arXiv Detail & Related papers (2020-08-20T18:10:13Z)
Lite Training Strategies for Portuguese-English and English-Portuguese Translation [67.4894325619275]
We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks. We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents. Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
arXiv Detail & Related papers (2020-08-20T04:31:03Z)
MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer [136.09386219006123]
We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages. MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning.
arXiv Detail & Related papers (2020-04-30T18:54:43Z)
Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task. Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.