Related papers: Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO

Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO

URL: http://arxiv.org/abs/2412.12997v2
Date: Fri, 17 Jan 2025 10:02:38 GMT
Title: Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO
Authors: Umer Butt, Stalin Veranasi, Günter Neumann,
Abstract summary: This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation.<n>We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset.<n>Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results.
Score: 0.6554326244334868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers. By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the ethical and societal importance of inclusive IR technologies. This work provides valuable insights into the challenges and solutions for improving language representation and lays the groundwork for future research, especially in South Asian languages, which can benefit from the adaptable methods used in this study.

Related papers

MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition [1.9500421038452647]
We introduce the U-MNER framework and release the Twitter2015-Urdu dataset.<n>Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules.<n>Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset.
arXiv Detail & Related papers (2025-05-08T11:38:20Z)
Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation [7.383944919243126]
We propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto.
arXiv Detail & Related papers (2025-04-07T15:18:34Z)
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [71.12193680015622]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings. We train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z)
From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language [41.272055304311905]
This paper focuses on the resource-constrained Urdu language, which is widely spoken across South Asian nations. It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR.
arXiv Detail & Related papers (2024-11-20T17:39:56Z)
Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages [24.856817602140193]
This study focuses on two endangered Austronesian languages, Amis and Seediq. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data.
arXiv Detail & Related papers (2024-09-13T14:35:47Z)
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [49.79783940841352]
Existing datasets are almost all in the English language. We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions. We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z)
BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language [4.720913027054481]
In this work, inspired by mMARCO and Mr.TyDi datasets, we translated all accessible open IR datasets into Polish. We introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark.
arXiv Detail & Related papers (2023-05-31T13:29:07Z)
Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data [26.38449396649045]
We show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead.
arXiv Detail & Related papers (2023-05-09T09:32:19Z)
Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another. We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages. The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.