TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
- URL: http://arxiv.org/abs/2511.01833v2
- Date: Wed, 05 Nov 2025 22:43:24 GMT
- Title: TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
- Authors: Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, Kaipeng Zhang,
- Abstract summary: OpenAI o3 can create and operate tools to transform images for problem-solving, also known as thinking-textitwith-images in chain-of-thought.<n>Visual Search tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning.<n>We introduce textbfTIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks.
- Score: 30.018325742295243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.
Related papers
- DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z) - MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.71264263478083]
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning.<n>We include 546 multimodal problems, annotated with intermediate visual images and final answers.
arXiv Detail & Related papers (2025-11-04T18:00:51Z) - On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset [16.921428284844684]
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
arXiv Detail & Related papers (2025-10-27T00:58:48Z) - ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z) - GenIR: Generative Visual Feedback for Mental Image Retrieval [8.753622774569774]
We study the task of Mental Image Retrieval (MIR)<n>MIR targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine.<n>We propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round.
arXiv Detail & Related papers (2025-06-06T16:28:03Z) - Visual Agentic Reinforcement Fine-Tuning [73.37007472426299]
This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs)<n>With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques.<n>Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search
arXiv Detail & Related papers (2025-05-20T11:59:25Z) - JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images [72.42826916932519]
We release JourneyBench, a benchmark of generated images to assess the model's fine-grained multimodal reasoning abilities.<n>Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios.<n>Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models.
arXiv Detail & Related papers (2024-09-19T17:58:16Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.