Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval
- URL: http://arxiv.org/abs/2511.19325v1
- Date: Mon, 24 Nov 2025 17:18:25 GMT
- Title: Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval
- Authors: Olivia Macmillan-Scott, Roksana Goworek, Eda B. Özyiğit,
- Abstract summary: Multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation.<n>This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance.
- Score: 0.19116784879310025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
Related papers
- Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation [73.54930910609328]
We propose LcRL, a multilingual search-augmented reinforcement learning framework.<n>LcRL integrates a language-coupled Group Relative Policy Optimization into the policy and reward models.<n>We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict.
arXiv Detail & Related papers (2026-01-21T11:32:32Z) - Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs [0.19116784879310025]
Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query.<n>Recent advances have shifted from translation-based methods toward embedding-based approaches.<n>This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques.
arXiv Detail & Related papers (2025-10-01T13:50:05Z) - VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z) - The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora [5.0908395672023055]
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages.<n>We study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets.<n>We propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query.
arXiv Detail & Related papers (2025-07-10T08:38:31Z) - Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task [89.45111250272559]
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP.<n>This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering.
arXiv Detail & Related papers (2025-04-04T17:35:43Z) - mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a dataset of territorial disputes paired with retrieved Wikipedia documents, across 49 languages.<n>We evaluate the cross-lingual robustness of this RAG setting by formalizing several modes for multilingual retrieval.<n>Our experiments show that incorporating perspectives from diverse languages can in fact improve robustness.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - One Question Answering Model for Many Languages with Cross-lingual Dense
Passage Retrieval [39.061900747689094]
CORA is a Cross-lingual Open-Retrieval Answer Generation model.
It can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable.
arXiv Detail & Related papers (2021-07-26T06:02:54Z) - LAReQA: Language-agnostic answer retrieval from a multilingual pool [29.553907688813347]
LAReQA tests for "strong" cross-lingual alignment.
We find that augmenting training data via machine translation is effective.
This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation.
arXiv Detail & Related papers (2020-04-11T20:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.