Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
- URL: http://arxiv.org/abs/2503.21696v1
- Date: Thu, 27 Mar 2025 17:00:51 GMT
- Title: Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
- Authors: Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang,
- Abstract summary: Embodied Reasoner is a model that extends o1 style reasoning to interactive embodied search tasks.<n>We synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes.<n>We develop a three-stage training pipeline that progressively enhances the model's capabilities.
- Score: 42.022527376404476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
Related papers
- VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [89.43196232124883]
VisuoThink is a novel framework that seamlessly integrates visuospatial and linguistic domains.
It enables progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search.
arXiv Detail & Related papers (2025-04-12T08:37:30Z) - How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game [11.721839449847472]
We introduce MM-Escape, a benchmark for investigating multimodal reasoning.<n> MM-Escape emphasizes intermediate model behaviors alongside final task completion.<n>Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks.<n>We observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities.
arXiv Detail & Related papers (2025-03-13T04:48:43Z) - Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ? [5.076961098583674]
We introduce a novel adversarial dataset to provide evidence for the dual thinking framework in human vision.<n>Our psychophysical studies show the presence of multiple inferences in rapid succession.<n>Analysis of errors shows that the early stopping of visual processing can result in missing relevant information.
arXiv Detail & Related papers (2024-06-11T05:50:34Z) - Solving the Clustering Reasoning Problems by Modeling a Deep-Learning-Based Probabilistic Model [1.7955614278088239]
We introduce PMoC, a deep-learning-based probabilistic model, achieving high reasoning accuracy in the Bongard-Logo.
As a bonus, we also designed Pose-Transformer for complex visual abstract reasoning tasks.
arXiv Detail & Related papers (2024-03-05T18:08:29Z) - Sim-to-Real Causal Transfer: A Metric Learning Approach to
Causally-Aware Interaction Representations [62.48505112245388]
We take an in-depth look at the causal awareness of modern representations of agent interactions.
We show that recent representations are already partially resilient to perturbations of non-causal agents.
We propose a metric learning approach that regularizes latent representations with causal annotations.
arXiv Detail & Related papers (2023-12-07T18:57:03Z) - Visual Chain of Thought: Bridging Logical Gaps with Multimodal
Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data.
Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - Causal Navigation by Continuous-time Neural Networks [108.84958284162857]
We propose a theoretical and experimental framework for learning causal representations using continuous-time neural networks.
We evaluate our method in the context of visual-control learning of drones over a series of complex tasks.
arXiv Detail & Related papers (2021-06-15T17:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.