Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
- URL: http://arxiv.org/abs/2509.07969v1
- Date: Tue, 09 Sep 2025 17:54:21 GMT
- Title: Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
- Authors: Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao,
- Abstract summary: Mini-o3 is a system that executes deep, multi-turn reasoning spanning tens of steps.<n>Our recipe for reproducing OpenAI o3-style behaviors comprises three key components.<n>Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths.
- Score: 85.201906907271
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
Related papers
- Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z) - Efficient Odd-One-Out Anomaly Detection [7.456608146535316]
Odd-one-out anomaly detection task involves identifying odd-looking instances within a multi-object scene.<n>This problem presents several challenges for modern deep learning models.<n>We propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three.
arXiv Detail & Related papers (2025-09-04T15:44:37Z) - Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z) - Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [42.022527376404476]
Embodied Reasoner is a model that extends o1 style reasoning to interactive embodied search tasks.<n>We synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes.<n>We develop a three-stage training pipeline that progressively enhances the model's capabilities.
arXiv Detail & Related papers (2025-03-27T17:00:51Z) - EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability [11.721839449847472]
We introduce MM-Escape, a benchmark for investigating multimodal reasoning.<n> MM-Escape emphasizes intermediate model behaviors alongside final task completion.<n>Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks.<n>We observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities.
arXiv Detail & Related papers (2025-03-13T04:48:43Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - Planning to Explore via Self-Supervised World Models [120.31359262226758]
Plan2Explore is a self-supervised reinforcement learning agent.
We present a new approach to self-supervised exploration and fast adaptation to new tasks.
Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods.
arXiv Detail & Related papers (2020-05-12T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.