Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
- URL: http://arxiv.org/abs/2305.06721v2
- Date: Tue, 20 Jun 2023 15:22:58 GMT
- Title: Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
- Authors: Jo\~ao Rodrigues, Lu\'is Gomes, Jo\~ao Silva, Ant\'onio Branco,
Rodrigo Santos, Henrique Lopes Cardoso, Tom\'as Os\'orio
- Abstract summary: Albertina PT-* is a foundation model that sets a new state of the art for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR)
The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese.
Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware.
- Score: 0.5937476291232802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To advance the neural encoding of Portuguese (PT), and a fortiori the
technological preparation of this language for the digital age, we developed a
Transformer-based foundation model that sets a new state of the art in this
respect for two of its variants, namely European Portuguese from Portugal
(PT-PT) and American Portuguese from Brazil (PT-BR).
To develop this encoder, which we named Albertina PT-*, a strong model was
used as a starting point, DeBERTa, and its pre-training was done over data sets
of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over
the brWaC corpus for PT-BR. The performance of Albertina and competing models
was assessed by evaluating them on prominent downstream language processing
tasks adapted for Portuguese.
Both Albertina PT-PT and PT-BR versions are distributed free of charge and
under the most permissive license possible and can be run on consumer-grade
hardware, thus seeking to contribute to the advancement of research and
innovation in language technology for Portuguese.
Related papers
- PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks.
To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine.
The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z) - Advancing Generative AI for Portuguese with Open Decoder Gerv\'asio PT* [0.38570000254272757]
We present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in neural decoding of Portuguese.
All versions of Gerv'asio are open source and distributed for free under an open license, including for either research or commercial usage.
arXiv Detail & Related papers (2024-02-29T00:19:13Z) - Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z) - On the Complementarity between Pre-Training and Back-Translation for
Neural Machine Translation [63.914940899327966]
Pre-training (PT) and back-translation (BT) are two simple and powerful methods to utilize monolingual data.
This paper takes the first step to investigate the complementarity between PT and BT.
We establish state-of-the-art performances on the WMT16 English-Romanian and English-Russian benchmarks.
arXiv Detail & Related papers (2021-10-05T04:01:36Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Transformers and Transfer Learning for Improving Portuguese Semantic
Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data.
We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z) - PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
data [4.579262239784748]
We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese.
We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
arXiv Detail & Related papers (2020-08-20T18:10:13Z) - Lite Training Strategies for Portuguese-English and English-Portuguese
Translation [67.4894325619275]
We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks.
We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents.
Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
arXiv Detail & Related papers (2020-08-20T04:31:03Z) - MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer [136.09386219006123]
We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages.
MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning.
arXiv Detail & Related papers (2020-04-30T18:54:43Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.