Related papers: Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

URL: http://arxiv.org/abs/2508.01918v1
Date: Sun, 03 Aug 2025 21:03:22 GMT
Title: Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language
Authors: Jaskaranjeet Singh, Rakesh Thakur,
Abstract summary: We present PunGPT2, the first fully open-source suite of Punjabi large language models.<n>We also present Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever.<n>We propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the rapid advancement of large language models (LLMs), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending LLM capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP

Related papers

Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks [33.2185998586144]
This study benchmarks the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages.<n>Findings show significant performance differences driven by linguistic diversity and resource availability.<n> Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance.
arXiv Detail & Related papers (2025-07-25T22:35:10Z)
NeoBabel: A Multilingual Open Tower for Visual Generation [32.79724699684266]
We introduce NeoBabel, a novel multilingual image generation framework.<n>It supports six languages: English, Chinese, Dutch, French, Hindi, and Persian.<n>It achieves state-of-the-art multilingual performance while retaining strong English capability.
arXiv Detail & Related papers (2025-07-08T16:19:45Z)
Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages [2.8811725782388686]
This paper investigates how we can overcome the limitation via prompt engineering on large language models (LLMs) focusing on low-resource Bengali language.<n>We investigate six prompting strategies - zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and finally our innovative metaphor prompting to detect hate speech effectively in low-resource languages.<n>To prove the effectiveness of our metaphor prompting in the low-resource Bengali language, we also evaluate it in another low-resource language Hindi, and two high-resource languages - English and German.
arXiv Detail & Related papers (2025-06-30T14:59:25Z)
Efficient Generation of Parameterised Quantum Circuits from Large Texts [0.3298092151372303]
DisCoCirc is capable of directly encoding entire documents as parameterised quantum circuits (PQCs)<n>This paper introduces an efficient methodology for converting large-scale texts into quantum circuits using tree-like representations of pregroup diagrams.
arXiv Detail & Related papers (2025-05-19T14:57:53Z)
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization [50.27950279695363]
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages.<n>A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data.<n>We propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding.
arXiv Detail & Related papers (2025-04-21T19:40:32Z)
Toward Quantum Machine Translation of Syntactically Distinct Languages [0.0]
We explore the feasibility of language translation using quantum natural language processing algorithms on noisy intermediate-scale quantum (NISQ) devices. We employ Shannon entropy to demonstrate the significant role of some appropriate angles of rotation gates in the performance of parametrized quantum circuits.
arXiv Detail & Related papers (2023-07-31T11:24:54Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Extrapolating Multilingual Understanding Models as Multilingual Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model. We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z)
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z)
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z)
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline. Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.