InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
- URL: http://arxiv.org/abs/2512.18745v1
- Date: Sun, 21 Dec 2025 14:23:07 GMT
- Title: InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
- Authors: Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang,
- Abstract summary: We introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details.<n>O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning.<n>We propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher)
- Score: 48.79494320593913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .
Related papers
- BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents [30.849897676091327]
Multimodal large language models (MLLMs) are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.<n>We introduce BrowseComp-$V3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains.<n>Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.
arXiv Detail & Related papers (2026-02-13T12:25:13Z) - When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.71264263478083]
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning.<n>We include 546 multimodal problems, annotated with intermediate visual images and final answers.
arXiv Detail & Related papers (2025-11-04T18:00:51Z) - MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z) - RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning [15.670921552151775]
RingMo-Agent is designed to handle multi-modal and multi-platform data.<n>It is supported by a large-scale vision-language dataset named RS-VL3M.<n>It proves effective in both visual understanding and sophisticated analytical tasks.
arXiv Detail & Related papers (2025-07-28T12:39:33Z) - Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z) - Visual Agentic AI for Spatial Reasoning with a Dynamic API [26.759236329608935]
We introduce an agentic program synthesis approach to solve 3D spatial reasoning problems.<n>Our method overcomes limitations of prior approaches that rely on a static, human-defined API.<n>We show that our method outperforms prior zero-shot models for visual reasoning in 3D.
arXiv Detail & Related papers (2025-02-10T18:59:35Z) - MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.<n>Existing benchmarks mostly assess their reasoning abilities in language part.<n>MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.