Related papers: DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

URL: http://arxiv.org/abs/2602.10809v1
Date: Wed, 11 Feb 2026 12:51:10 GMT
Title: DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
Authors: Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou,
Abstract summary: We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
Score: 52.57197752244638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

Related papers

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval [29.907367363360652]
PhotoBench is the first benchmark constructed from authentic, personal albums.<n>It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning.
arXiv Detail & Related papers (2026-03-02T06:02:40Z)
What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation [35.62323084880028]
We propose textbfImagineAgent, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding.<n>Our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions.<n>It dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence.
arXiv Detail & Related papers (2026-02-12T02:51:59Z)
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z)
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z)
ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation [49.01601313084479]
ImAgent is a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation.<n>Experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone.
arXiv Detail & Related papers (2025-11-14T17:00:29Z)
Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce [61.03081096959132]
We propose a context-aware reasoning-enhanced generative search framework for better textbfunderstanding the complicated context.<n>Our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.
arXiv Detail & Related papers (2025-10-19T16:46:11Z)
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning [16.880791276029964]
"Thinking with images" is a shift in Vision Language Models from text-dominant chain-of-thought to image-interactive reasoning.<n>We present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model.<n>We design a model that performs interleaved image-text reasoning and generates "visual thoughts" by operating directly in the visual embedding space.
arXiv Detail & Related papers (2025-09-30T07:02:01Z)
MLego: Interactive and Scalable Topic Exploration Through Model Reuse [12.133380833451573]
We present MLego, an interactive query framework designed to support real-time topic modeling analysis.<n>Instead of retraining models from scratch, MLego efficiently merges materialized topic models to construct approximate results at interactive speeds.<n>We integrate MLego into a visual analytics prototype system, enabling users to explore large-scale textual datasets through interactive queries.
arXiv Detail & Related papers (2025-08-11T06:06:26Z)
GenIR: Generative Visual Feedback for Mental Image Retrieval [8.753622774569774]
We study the task of Mental Image Retrieval (MIR)<n>MIR targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine.<n>We propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round.
arXiv Detail & Related papers (2025-06-06T16:28:03Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.