Related papers: TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

URL: http://arxiv.org/abs/2511.01833v2
Date: Wed, 05 Nov 2025 22:43:24 GMT
Title: TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
Authors: Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, Kaipeng Zhang,
Abstract summary: OpenAI o3 can create and operate tools to transform images for problem-solving, also known as thinking-textitwith-images in chain-of-thought.<n>Visual Search tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning.<n>We introduce textbfTIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks.
Score: 30.018325742295243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.

Related papers

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z)
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z)
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z)
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.71264263478083]
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning.<n>We include 546 multimodal problems, annotated with intermediate visual images and final answers.
arXiv Detail & Related papers (2025-11-04T18:00:51Z)
On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset [16.921428284844684]
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
arXiv Detail & Related papers (2025-10-27T00:58:48Z)
ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z)
GenIR: Generative Visual Feedback for Mental Image Retrieval [8.753622774569774]
We study the task of Mental Image Retrieval (MIR)<n>MIR targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine.<n>We propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round.
arXiv Detail & Related papers (2025-06-06T16:28:03Z)
Visual Agentic Reinforcement Fine-Tuning [73.37007472426299]
This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs)<n>With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques.<n>Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search
arXiv Detail & Related papers (2025-05-20T11:59:25Z)
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images [72.42826916932519]
We release JourneyBench, a benchmark of generated images to assess the model's fine-grained multimodal reasoning abilities.<n>Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios.<n>Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models.
arXiv Detail & Related papers (2024-09-19T17:58:16Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.