Introducing Bode: A Fine-Tuned Large Language Model for Portuguese
Prompt-Based Task
- URL: http://arxiv.org/abs/2401.02909v1
- Date: Fri, 5 Jan 2024 17:15:01 GMT
- Title: Introducing Bode: A Fine-Tuned Large Language Model for Portuguese
Prompt-Based Task
- Authors: Gabriel Lino Garcia, Pedro Henrique Paiola, Luis Henrique Morelli,
Giovani Candido, Arnaldo C\^andido J\'unior, Danilo Samuel Jodas, Luis C. S.
Afonso, Ivan Rizzo Guilherme, Bruno Elias Penteado, Jo\~ao Paulo Papa
- Abstract summary: This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode.
We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
- Score: 1.158680734110387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly bringing advances to Natural
Language Processing. However, low-resource languages, those lacking extensive
prominence in datasets for various NLP tasks, or where existing datasets are
not as substantial, such as Portuguese, already obtain several benefits from
LLMs, but not to the same extent. LLMs trained on multilingual datasets
normally struggle to respond to prompts in Portuguese satisfactorily,
presenting, for example, code switching in their responses. This work proposes
a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode in two
versions: 7B and 13B. We evaluate the performance of this model in
classification tasks using the zero-shot approach with in-context learning, and
compare it with other LLMs. Our main contribution is to bring an LLM with
satisfactory results in the Portuguese language, as well as to provide a model
that is free for research or commercial purposes.
Related papers
- Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in
Low-Resource Languages [0.0]
"prompting" is where a user provides a description of a task and some completed examples of the task to a PLM as context before prompting the PLM to perform the task on a new example.
We consider three methods: few-shot prompting (prompt), language-adaptive fine-tuning (LAFT), and neural machine translation (translate)
We find that translate and prompt settings are a compute-efficient and cost-effective method of few-shot prompting for the selected low-resource languages.
arXiv Detail & Related papers (2024-03-09T21:36:13Z) - Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large
Language Models and its Methodology [4.396516562723691]
This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records.
The results suggest that our dataset is possibly beneficial for LLMs.
However, we also revealed some difficulties in constructing LLMs in languages other than English.
arXiv Detail & Related papers (2023-05-22T04:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.