Gl\'orIA - A Generative and Open Large Language Model for Portuguese
- URL: http://arxiv.org/abs/2402.12969v1
- Date: Tue, 20 Feb 2024 12:36:40 GMT
- Title: Gl\'orIA - A Generative and Open Large Language Model for Portuguese
- Authors: Ricardo Lopes and Jo\~ao Magalh\~aes and David Semedo
- Abstract summary: We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
- Score: 4.782288068552145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Significant strides have been made in natural language tasks, largely
attributed to the emergence of powerful large language models (LLMs). These
models, pre-trained on extensive and diverse corpora, have become increasingly
capable of comprehending the intricacies of language. Despite the abundance of
LLMs for many high-resource languages, the availability of such models remains
limited for European Portuguese. We introduce Gl\'orIA, a robust European
Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive
PT-PT text corpus comprising 35 billion tokens from various sources. We present
our pre-training methodology, followed by an assessment of the model's
effectiveness on multiple downstream tasks. Additionally, to evaluate our
models' language modeling capabilities, we introduce CALAME-PT (Context-Aware
LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot
language-modeling benchmark. Evaluation shows that Gl\'orIA significantly
outperforms existing open PT decoder models in language modeling and that it
can generate sound, knowledge-rich, and coherent PT-PT text. The model also
exhibits strong potential for various downstream tasks.
Related papers
- MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks [30.54635848057259]
This paper conducts a comprehensive evaluation of well-known and high-performing large language models (LLMs)
We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization.
Our study reports both automatic results, accompanied by a detailed analysis.
arXiv Detail & Related papers (2024-05-16T16:56:54Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Transformers and Transfer Learning for Improving Portuguese Semantic
Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data.
We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.