Building a Llama2-finetuned LLM for Odia Language Utilizing Domain
Knowledge Instruction Set
- URL: http://arxiv.org/abs/2312.12624v1
- Date: Tue, 19 Dec 2023 22:01:01 GMT
- Title: Building a Llama2-finetuned LLM for Odia Language Utilizing Domain
Knowledge Instruction Set
- Authors: Guneet Singh Kohli, Shantipriya Parida, Sambit Sekhar, Samirit Saha,
Nipun B Nair, Parul Agarwal, Sonal Khosla, Kusumlata Patiyal, Debasish Dhal
- Abstract summary: Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs.
This paper presents our approach of i) generating a large Odia instruction set, including domain knowledge data suitable for LLM fine-tuning, and ii) building a Llama2-finetuned model tailored for enhanced performance in the Odia domain.
- Score: 1.6261478739213642
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Building LLMs for languages other than English is in great demand due to the
unavailability and performance of multilingual LLMs, such as understanding the
local context. The problem is critical for low-resource languages due to the
need for instruction sets. In a multilingual country like India, there is a
need for LLMs supporting Indic languages to provide generative AI and LLM-based
technologies and services to its citizens.
This paper presents our approach of i) generating a large Odia instruction
set, including domain knowledge data suitable for LLM fine-tuning, and ii)
building a Llama2-finetuned model tailored for enhanced performance in the Odia
domain. The proposed work will help researchers build an instruction set and
LLM, particularly for Indic languages. We will release the model and
instruction set for the public for research and noncommercial purposes.
Related papers
- Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi [0.745652600521932]
We are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi.
The dataset contains 1.28 billion Hindi tokens.
arXiv Detail & Related papers (2024-07-13T11:29:20Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - InstructAlign: High-and-Low Resource Language Alignment via Continual
Crosslingual Instruction Tuning [66.31509106146605]
Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages.
However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data.
We propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages.
arXiv Detail & Related papers (2023-05-23T02:51:34Z) - llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large
Language Models and its Methodology [4.396516562723691]
This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records.
The results suggest that our dataset is possibly beneficial for LLMs.
However, we also revealed some difficulties in constructing LLMs in languages other than English.
arXiv Detail & Related papers (2023-05-22T04:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.