One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?
- URL: http://arxiv.org/abs/2601.10205v1
- Date: Thu, 15 Jan 2026 09:10:14 GMT
- Title: One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?
- Authors: Arya Shah, Himanshu beniwal, Mayank Singh,
- Abstract summary: We present a benchmark spanning 12 Indian languages and four evaluation tasks.<n>E5-Large-Instruct achieves the highest Recall@1 of 27.4% on monolingual retrieval and 20.7% on cross-lingual transfer.<n>For classification, LaBSE attains 75.3% AUROC with strong calibration.
- Score: 1.071318785217926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.
Related papers
- What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models [0.19116784879310025]
Cross-lingual information retrieval is challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models.<n>Existing pipelines often rely on translation and monolingual retrievals, which add computational overhead and noise, performance.<n>This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets.
arXiv Detail & Related papers (2025-11-24T17:17:40Z) - Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs [47.944645462877894]
Referring Expression (REC) requires models to localize objects in images based on natural language descriptions.<n>This work addresses multilingual REC through two main contributions.<n>We construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement.<n>The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects.
arXiv Detail & Related papers (2025-11-14T15:54:34Z) - L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages [2.584263027095689]
L3Cube-IndicHeadline-ID is a curated dataset spanning ten low-resource Indic languages.<n>Each language includes 20,000 news articles paired with four headline variants.<n>The task requires selecting the correct headline from the options using article-headline similarity.<n>We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity.
arXiv Detail & Related papers (2025-09-02T16:54:30Z) - IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems [17.88837706307504]
IndicMSMarco is a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages.<n>We build a large-scale dataset of (question, answer, relevant passage)s derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-02T12:55:51Z) - mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.