Related papers: The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

URL: http://arxiv.org/abs/2507.07543v1
Date: Thu, 10 Jul 2025 08:38:31 GMT
Title: The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Authors: Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan,
Abstract summary: Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages.<n>We study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets.<n>We propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages.
Score: 6.594531626178451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.

Related papers

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG [1.4425299138308667]
BM25 rank documents by term overlap with corpus-level weighting.<n>End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches.<n>We demonstrate that better document representation is the primary driver of benchmark improvements.
arXiv Detail & Related papers (2026-03-04T16:21:20Z)
Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation [73.54930910609328]
We propose LcRL, a multilingual search-augmented reinforcement learning framework.<n>LcRL integrates a language-coupled Group Relative Policy Optimization into the policy and reward models.<n>We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict.
arXiv Detail & Related papers (2026-01-21T11:32:32Z)
Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval [0.19116784879310025]
Multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation.<n>This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance.
arXiv Detail & Related papers (2025-11-24T17:18:25Z)
Skeletons Matter: Dynamic Data Augmentation for Text-to-Query [66.52311036179294]
We formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages.<n>We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework.<n> Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-11-24T09:39:03Z)
Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs [0.19116784879310025]
Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query.<n>Recent advances have shifted from translation-based methods toward embedding-based approaches.<n>This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques.
arXiv Detail & Related papers (2025-10-01T13:50:05Z)
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z)
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents [2.0277446818410994]
This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search.<n>The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform.
arXiv Detail & Related papers (2025-04-22T20:55:08Z)
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task [73.35882908048423]
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP.<n>This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering.
arXiv Detail & Related papers (2025-04-04T17:35:43Z)
Investigating Language Preference of Multilingual RAG Systems [4.438698005789677]
MRAG systems struggle with retrieving relevant information due to linguistic variations between queries and documents.<n>We propose Dual Knowledge Multilingual RAG (DKM-RAG), a framework that fuses translated multilingual passages with complementary model knowledge.<n> Empirical results demonstrate that DKM-RAG mitigates language preference in generation and enhances performance across diverse linguistic settings.
arXiv Detail & Related papers (2025-02-16T15:54:05Z)
mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z)
Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a dataset of territorial disputes paired with retrieved Wikipedia documents, across 49 languages.<n>We evaluate the cross-lingual robustness of this RAG setting by formalizing several modes for multilingual retrieval.<n>Our experiments show that incorporating perspectives from diverse languages can in fact improve robustness.
arXiv Detail & Related papers (2024-10-02T01:59:07Z)
Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval. We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z)
CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages. We present the first fact-checking framework augmented with crosslingual retrieval. We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
LAReQA: Language-agnostic answer retrieval from a multilingual pool [29.553907688813347]
LAReQA tests for "strong" cross-lingual alignment. We find that augmenting training data via machine translation is effective. This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation.
arXiv Detail & Related papers (2020-04-11T20:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.