Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi
- URL: http://arxiv.org/abs/2407.09855v1
- Date: Sat, 13 Jul 2024 11:29:20 GMT
- Title: Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi
- Authors: Shantipriya Parida, Shakshi Panwar, Kusum Lata, Sanskruti Mishra, Sambit Sekhar,
- Abstract summary: We are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi.
The dataset contains 1.28 billion Hindi tokens.
- Score: 0.745652600521932
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.
Related papers
- INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families.
We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance.
We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z) - LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback [61.23008372927665]
We introduce xLLMs-100, which scales the multilingual capabilities of LLaMA and BLOOM to 100 languages.
We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks.
arXiv Detail & Related papers (2024-06-03T20:25:12Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Building a Llama2-finetuned LLM for Odia Language Utilizing Domain
Knowledge Instruction Set [1.6261478739213642]
Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs.
This paper presents our approach of i) generating a large Odia instruction set, including domain knowledge data suitable for LLM fine-tuning, and ii) building a Llama2-finetuned model tailored for enhanced performance in the Odia domain.
arXiv Detail & Related papers (2023-12-19T22:01:01Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large
Language Models and its Methodology [4.396516562723691]
This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records.
The results suggest that our dataset is possibly beneficial for LLMs.
However, we also revealed some difficulties in constructing LLMs in languages other than English.
arXiv Detail & Related papers (2023-05-22T04:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.