VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
- URL: http://arxiv.org/abs/2508.07493v2
- Date: Mon, 25 Aug 2025 01:29:22 GMT
- Title: VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
- Authors: Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, Ruiyi Zhang,
- Abstract summary: VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
- Score: 49.07705729597171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.
Related papers
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval [0.0]
We present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages.<n>We generalize across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment.<n>Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with 150% relative improvements on cross-lingual retrieval.
arXiv Detail & Related papers (2025-12-03T07:17:59Z) - Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation [47.714317480436215]
PREMIR is a simple framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval.<n> Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings.
arXiv Detail & Related papers (2025-08-23T16:14:41Z) - MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query [55.486895951981566]
MERIT is the first multilingual dataset for interleaved multi-condition semantic retrieval.<n>This paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval.
arXiv Detail & Related papers (2025-06-03T17:59:14Z) - Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark [1.8448587047759064]
We introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark.<n> MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset.<n>We observe a gap in state-of-the-art VLM-based embedding models on multilingual capabilities.
arXiv Detail & Related papers (2025-05-16T19:22:19Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval.<n>The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels.
arXiv Detail & Related papers (2025-01-15T14:30:13Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD)
We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.