Related papers: Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages

Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages

URL: http://arxiv.org/abs/2401.18034v2
Date: Thu, 10 Oct 2024 16:19:59 GMT
Title: Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages
Authors: Mitodru Niyogi, Arnab Bhattacharya,
Abstract summary: We present "Paramanu", a family of novel language models (LM) for Indian languages. It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts. The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
Score: 3.9018931027384056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present "Paramanu", a family of novel language models (LM) for Indian languages, consisting of auto-regressive monolingual, bilingual, and multilingual models pretrained from scratch. Currently, it covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu). The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters. We proposed a RoPE embedding scaling method that enables us to pretrain language models from scratch at larger sequence length context size than typical GPU memory permits. We also introduced a novel efficient Indic tokenizer, "mBharat", using a combination of BPE and Unigram, achieving the least fertility score and the ability to tokenize unseen languages in both the same script & Roman script. We also proposed and performed language-specific tokenization for multilingual models & domain-specific tokenization for monolingual models. To address the "curse of multilinguality" in our mParamanu model, we pretrained on comparable corpora based on typological grouping within the same script. Our findings show a language transfer phenomenon from low-resource to high-resource languages within languages of the same script & typology. Human evaluations for open-ended text generation demonstrated that Paramanu models outperformed several LLMs, despite being 20 to 64 times smaller. We created instruction-tuning datasets & instruction-tuned our models on 23,000 instructions in respective languages. Comparisons with multilingual LLMs across various benchmarks for natural language (NL) understanding, NL inference, & reading comprehension highlight the advantages of our models; leads to the conclusion that high quality generative LM are possible without high amount of compute power & enormous number of parameters.

Related papers

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi [9.65814816271915]
LilMoo is a 0.6-billion- parameter Hindi language model trained entirely from scratch.<n>It is developed through a fully transparent and reproducible pipeline optimized for limited compute environments.<n>Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines.
arXiv Detail & Related papers (2026-03-03T20:31:25Z)
ILID: Native Script Language Identification for Indian Languages [0.0]
Core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments.<n>We release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers.<n>Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task.
arXiv Detail & Related papers (2025-07-16T01:39:32Z)
Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages. We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z)
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script. We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages. We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z)
SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z)
mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism. The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z)
Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.