IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning
Datasets for Indian Languages
- URL: http://arxiv.org/abs/2403.06350v1
- Date: Mon, 11 Mar 2024 00:46:56 GMT
- Title: IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning
Datasets for Indian Languages
- Authors: Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar
Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad G, Varun Balan G, Sparsh
Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra
- Abstract summary: This work introduces an expansive suite of resources specifically designed for the development of Indic LLMs.
Our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data.
For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models.
- Score: 37.79850860981589
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the considerable advancements in English LLMs, the progress in
building comparable models for other languages has been hindered due to the
scarcity of tailored resources. Our work aims to bridge this divide by
introducing an expansive suite of resources specifically designed for the
development of Indic LLMs, covering 22 languages, containing a total of 251B
tokens and 74.8M instruction-response pairs. Recognizing the importance of both
data quality and quantity, our approach combines highly curated manually
verified data, unverified yet valuable data, and synthetic data. We build a
clean, open-source pipeline for curating pre-training data from diverse
sources, including websites, PDFs, and videos, incorporating best practices for
crawling, cleaning, flagging, and deduplication. For instruction-fine tuning,
we amalgamate existing Indic datasets, translate/transliterate English datasets
into Indian languages, and utilize LLaMa2 and Mixtral models to create
conversations grounded in articles from Indian Wikipedia and Wikihow.
Additionally, we address toxicity alignment by generating toxic prompts for
multiple scenarios and then generate non-toxic responses by feeding these toxic
prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and
resources released as a part of this work will not only propel the research and
development of Indic LLMs but also establish an open-source blueprint for
extending such efforts to other languages. The data and other artifacts created
as part of this work are released with permissive licenses.
Related papers
- Pretraining Data and Tokenizer for Indic LLM [1.7729311045335219]
We develop a novel approach to data preparation for developing multilingual Indic large language model.
Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia.
For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content.
arXiv Detail & Related papers (2024-07-17T11:06:27Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets [2.8123257987021058]
We focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets.
We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model.
The fine-tuned model shows promising results in different NLP tasks.
arXiv Detail & Related papers (2024-02-12T19:25:11Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Cross-lingual Editing in Multilingual Language Models [1.3062731746155414]
This paper introduces the cross-lingual model editing (textbfXME) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages.
The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families.
arXiv Detail & Related papers (2024-01-19T06:54:39Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.