Related papers: Pretraining Data and Tokenizer for Indic LLM

Pretraining Data and Tokenizer for Indic LLM

URL: http://arxiv.org/abs/2407.12481v1
Date: Wed, 17 Jul 2024 11:06:27 GMT
Title: Pretraining Data and Tokenizer for Indic LLM
Authors: Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar,
Abstract summary: We develop a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content.
Score: 1.7729311045335219
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

Related papers

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages [4.279942349440352]
We present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages.<n>We construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages.<n>We analyze how language choice, both in the prompt instructions and document grounding, affects data quality.
arXiv Detail & Related papers (2025-11-13T14:12:44Z)
IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages [0.4194295877935868]
We present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages.<n>IndicSQuAD comprises extensive training, validation, and test sets for each language.<n>We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT.
arXiv Detail & Related papers (2025-05-06T16:42:54Z)
Tokenization Matters: Improving Zero-Shot NER for Indic Languages [2.964265227875254]
Tokenization is a critical component of Natural Language Processing (NLP) This work systematically compares BPE, SentencePiece, and Character Level tokenization strategies using Indic languages. Results show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic languages.
arXiv Detail & Related papers (2025-04-23T17:28:38Z)
Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z)
One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks [26.848664285007022]
ByT5-Sanskrit is designed for NLP applications involving the morphologically rich language Sanskrit. It is easier to deploy and more robust to data not covered by external linguistic resources. We show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages.
arXiv Detail & Related papers (2024-09-20T22:02:26Z)
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages. MURI produces instruction-output pairs from existing human-written texts in low-resource languages. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z)
Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation [19.624330093598996]
Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian.
arXiv Detail & Related papers (2023-10-05T11:45:29Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization. We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer. Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z)
Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese. We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset. We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.