ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
- URL: http://arxiv.org/abs/2602.09839v1
- Date: Tue, 10 Feb 2026 14:45:02 GMT
- Title: ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
- Authors: Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang, Xi Peng,
- Abstract summary: We introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives.<n>ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types.<n>We observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks.
- Score: 19.93676370851117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.
Related papers
- V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval [32.5242219186118]
We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection.<n>V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.
arXiv Detail & Related papers (2026-02-05T18:59:21Z) - MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval [87.24221266746686]
We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning.<n>It challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains.<n> queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides.
arXiv Detail & Related papers (2025-10-10T16:14:56Z) - MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z) - MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z) - ADSeeker: A Knowledge-Infused Framework for Anomaly Detection and Reasoning [17.249025173985697]
We propose ADSeeker, an anomaly task assistant designed to enhance inspection performance through knowledge-grounded reasoning.<n>To tackle the challenge of limited in industry anomaly detection (IAD) data, we introduce the largest-scale AD dataset, Multi-type Anomaly (MulA)<n>Our plug-and-play framework, ADSeeker, achieves state-of-the-art zero-shot performance on several benchmark datasets.
arXiv Detail & Related papers (2025-08-05T05:05:06Z) - METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z) - Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval [31.69320295943039]
Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA)<n>We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy.
arXiv Detail & Related papers (2025-05-10T14:24:41Z) - Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z) - CliniQ: A Multi-faceted Benchmark for Electronic Health Record Retrieval with Semantic Match Assessment [11.815222175336695]
We introduce a novel public EHR retrieval benchmark, CliniQ, to address this gap.<n>We build our benchmark upon 1,000 discharge summary notes along with the ICD codes and prescription labels from MIMIC-III.<n>We conduct a comprehensive evaluation of various retrieval methods, ranging from conventional exact match to popular dense retrievers.
arXiv Detail & Related papers (2025-02-10T08:33:47Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.