Related papers: Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family

Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family

URL: http://arxiv.org/abs/2403.01897v2
Date: Tue, 5 Mar 2024 10:49:17 GMT
Title: Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
Authors: Rodrigo Santos, Jo\~ao Rodrigues, Lu\'is Gomes, Jo\~ao Silva, Ant\'onio Branco, Henrique Lopes Cardoso, Tom\'as Freitas Os\'orio, Bernardo Leite
Abstract summary: This paper contributes foundation encoder models that are open source and openly distributed for free under an open license for any purpose. We present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters.
Score: 0.3230831234454389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While achieving this primary goal, further results that are relevant for this ecosystem were obtained as well, namely new datasets for Portuguese based on the SuperGLUE benchmark, which we also distribute openly.

Related papers

Tradutor: Building a Variety Specific Translation Model [3.976102757693942]
We introduce the first open-source translation model specifically tailored for European Portuguese. Our best model surpasses existing open-source translation systems for Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research.
arXiv Detail & Related papers (2025-02-20T09:20:59Z)
Tucano: Advancing Neural Text Generation for Portuguese [0.0]
This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks.
arXiv Detail & Related papers (2024-11-12T15:06:06Z)
PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z)
Advancing Generative AI for Portuguese with Open Decoder Gerv\'asio PT* [0.38570000254272757]
We present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in neural decoding of Portuguese. All versions of Gerv'asio are open source and distributed for free under an open license, including for either research or commercial usage.
arXiv Detail & Related papers (2024-02-29T00:19:13Z)
Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM. To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
Conversations in Galician: a Large Language Model for an Underrepresented Language [2.433983268807517]
This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. As a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model.
arXiv Detail & Related papers (2023-11-07T08:52:28Z)
Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z)
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages. We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z)
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [264.96498474333697]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
arXiv Detail & Related papers (2022-11-09T18:48:09Z)
Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language. Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English. We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.