Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
- URL: http://arxiv.org/abs/2603.04238v1
- Date: Wed, 04 Mar 2026 16:21:20 GMT
- Title: Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
- Authors: Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi,
- Abstract summary: BM25 rank documents by term overlap with corpus-level weighting.<n>End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches.<n>We demonstrate that better document representation is the primary driver of benchmark improvements.
- Score: 1.4425299138308667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.
Related papers
- ReFeed: Retrieval Feedback-Guided Dataset Construction for Style-Aware Query Rewriting [0.4077787659104315]
Retrieval systems often fail when user queries differ stylistically or semantically from the language used in domain documents.<n>This work highlights a new direction in data-centric information retrieval, emphasizing how feedback loops and document-style alignment can enhance the reasoning and adaptability of RAG systems.
arXiv Detail & Related papers (2026-03-02T03:43:53Z) - VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z) - Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a dataset of territorial disputes paired with retrieved Wikipedia documents, across 49 languages.<n>We evaluate the cross-lingual robustness of this RAG setting by formalizing several modes for multilingual retrieval.<n>Our experiments show that incorporating perspectives from diverse languages can in fact improve robustness.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - CODER: An efficient framework for improving retrieval through
COntextualized Document Embedding Reranking [11.635294568328625]
We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost.
It utilizes precomputed document representations extracted by a base dense retrieval method.
It incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method.
arXiv Detail & Related papers (2021-12-16T10:25:26Z) - Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective.
We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization.
For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.