Related papers: PARAM-1 BharatGen 2.9B Model

PARAM-1 BharatGen 2.9B Model

URL: http://arxiv.org/abs/2507.13390v1
Date: Wed, 16 Jul 2025 06:14:33 GMT
Title: PARAM-1 BharatGen 2.9B Model
Authors: Kundeshwar Pundalik, Piyush Sawarkar, Nihar Sahoo, Abhishek Shinde, Prateek Chanda, Vedant Goswami, Ajay Nagpal, Atul Singh, Viraj Thakur, Vijay Dewane, Aamod Thakur, Bhargav Patel, Smita Gautam, Bhagwan Panditi, Shyam Pawar, Madhav Kotcha, Suraj Racha, Saral Sureka, Pankaj Singh, Rishi Bal, Rohit Saluja, Ganesh Ramakrishnan,
Abstract summary: PARAM-1 is a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity.<n>It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks.
Score: 14.552007884700618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.

Related papers

IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding [2.062076715606512]
Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research.<n>IndicMMLU-Pro is a benchmark designed to evaluate Large Language Models (LLMs) across Indic languages.
arXiv Detail & Related papers (2025-01-27T03:19:03Z)
Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages [27.273651323572786]
BhasaAnuvaad is the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments.<n>Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation.<n>We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
arXiv Detail & Related papers (2024-11-07T13:33:34Z)
Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India. It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z)
Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z)
Formal Aspects of Language Modeling [74.16212987886013]
Large language models have become one of the most commonly deployed NLP inventions. These notes are the accompaniment to the theoretical portion of the ETH Z"urich course on large language models.
arXiv Detail & Related papers (2023-11-07T20:21:42Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages. We train and evaluate different self-supervised models alongside a commonly used baseline benchmark. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z)
CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.