Related papers: Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5

Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5

URL: http://arxiv.org/abs/2409.05401v2
Date: Fri, 25 Oct 2024 08:41:17 GMT
Title: Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5
Authors: Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen,
Abstract summary: We introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges. We introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data.
Score: 8.21020989074456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.

Related papers

IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems [17.88837706307504]
IndicMSMarco is a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages.<n>We build a large-scale dataset of (question, answer, relevant passage)s derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-02T12:55:51Z)
Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance [0.0]
We present our latest Hindi-English bi-lingual LLM textbfMantra-14B with 3% average improvement in benchmark scores over both languages. We instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead.
arXiv Detail & Related papers (2025-04-13T23:10:13Z)
mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages [27.273651323572786]
BhasaAnuvaad is the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments.<n>Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation.<n>We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
arXiv Detail & Related papers (2024-11-07T13:33:34Z)
LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases. We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor. We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z)
Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi [8.21020989074456]
Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. We introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance.
arXiv Detail & Related papers (2024-08-18T10:55:04Z)
Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India. It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z)
Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z)
Summarizing Indian Languages using Multilingual Transformers based Models [13.062351454646912]
We study how these multilingual models perform on the datasets which have Indian languages as source and target text. We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
arXiv Detail & Related papers (2023-03-29T13:05:17Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages. The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.