Advancing Generative AI for Portuguese with Open Decoder Gerv\'asio PT*
- URL: http://arxiv.org/abs/2402.18766v2
- Date: Tue, 5 Mar 2024 10:44:03 GMT
- Title: Advancing Generative AI for Portuguese with Open Decoder Gerv\'asio PT*
- Authors: Rodrigo Santos, Jo\~ao Silva, Lu\'is Gomes, Jo\~ao Rodrigues,
Ant\'onio Branco
- Abstract summary: We present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in neural decoding of Portuguese.
All versions of Gerv'asio are open source and distributed for free under an open license, including for either research or commercial usage.
- Score: 0.38570000254272757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To advance the neural decoding of Portuguese, in this paper we present a
fully open Transformer-based, instruction-tuned decoder model that sets a new
state of the art in this respect. To develop this decoder, which we named
Gerv\'asio PT*, a strong LLaMA~2 7B model was used as a starting point, and its
further improvement through additional training was done over language
resources that include new instruction data sets of Portuguese prepared for
this purpose, which are also contributed in this paper. All versions of
Gerv\'asio are open source and distributed for free under an open license,
including for either research or commercial usage, and can be run on
consumer-grade hardware, thus seeking to contribute to the advancement of
research and innovation in language technology for Portuguese.
Related papers
- PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks.
To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine.
The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z) - Fostering the Ecosystem of Open Neural Encoders for Portuguese with
Albertina PT* Family [0.3230831234454389]
This paper contributes foundation encoder models that are open source and openly distributed for free under an open license for any purpose.
We present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters.
arXiv Detail & Related papers (2024-03-04T09:56:47Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - Advancing Neural Encoding of Portuguese with Transformer Albertina PT-* [0.5937476291232802]
Albertina PT-* is a foundation model that sets a new state of the art for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR)
The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese.
Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware.
arXiv Detail & Related papers (2023-05-11T10:56:20Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Transformers and Transfer Learning for Improving Portuguese Semantic
Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data.
We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.