Fostering the Ecosystem of Open Neural Encoders for Portuguese with
Albertina PT* Family
- URL: http://arxiv.org/abs/2403.01897v2
- Date: Tue, 5 Mar 2024 10:49:17 GMT
- Title: Fostering the Ecosystem of Open Neural Encoders for Portuguese with
Albertina PT* Family
- Authors: Rodrigo Santos, Jo\~ao Rodrigues, Lu\'is Gomes, Jo\~ao Silva,
Ant\'onio Branco, Henrique Lopes Cardoso, Tom\'as Freitas Os\'orio, Bernardo
Leite
- Abstract summary: This paper contributes foundation encoder models that are open source and openly distributed for free under an open license for any purpose.
We present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters.
- Score: 0.3230831234454389
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To foster the neural encoding of Portuguese, this paper contributes
foundation encoder models that represent an expansion of the still very scarce
ecosystem of large language models specifically developed for this language
that are fully open, in the sense that they are open source and openly
distributed for free under an open license for any purpose, thus including
research and commercial usages. Like most languages other than English,
Portuguese is low-resourced in terms of these foundational language resources,
there being the inaugural 900 million parameter Albertina and 335 million
Bertimbau. Taking this couple of models as an inaugural set, we present the
extension of the ecosystem of state-of-the-art open encoders for Portuguese
with a larger, top performance-driven model with 1.5 billion parameters, and a
smaller, efficiency-driven model with 100 million parameters. While achieving
this primary goal, further results that are relevant for this ecosystem were
obtained as well, namely new datasets for Portuguese based on the SuperGLUE
benchmark, which we also distribute openly.
Related papers
- PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks.
To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine.
The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z) - Advancing Generative AI for Portuguese with Open Decoder Gerv\'asio PT* [0.38570000254272757]
We present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in neural decoding of Portuguese.
All versions of Gerv'asio are open source and distributed for free under an open license, including for either research or commercial usage.
arXiv Detail & Related papers (2024-02-29T00:19:13Z) - Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Conversations in Galician: a Large Language Model for an
Underrepresented Language [2.433983268807517]
This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language.
We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations.
As a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model.
arXiv Detail & Related papers (2023-11-07T08:52:28Z) - CebuaNER: A New Baseline Cebuano Named Entity Recognition Model [1.5056924758531152]
We introduce CebuaNER, a new baseline model for named entity recognition in the Cebuano language.
To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language.
Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags.
arXiv Detail & Related papers (2023-10-01T14:09:42Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [264.96498474333697]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.
We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers.
BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
arXiv Detail & Related papers (2022-11-09T18:48:09Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.