Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
- URL: http://arxiv.org/abs/2508.17079v1
- Date: Sat, 23 Aug 2025 16:14:41 GMT
- Title: Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
- Authors: Yejin Choi, Jaewoo Park, Janghan Yoon, Saejin Kim, Jaehyun Jeon, Youngjae Yu,
- Abstract summary: PREMIR is a simple framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval.<n> Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings.
- Score: 47.714317480436215
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.
Related papers
- Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval [27.493644447594367]
MCMR (Multi-Conditional Multimodal Retrieval) is a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries.<n>It spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture.<n>We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities.
arXiv Detail & Related papers (2026-03-01T12:53:47Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model [9.224965304457708]
This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework.<n>It incorporates image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering.<n>Experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results.
arXiv Detail & Related papers (2024-11-19T07:16:48Z) - MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.