Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model
- URL: http://arxiv.org/abs/2512.08188v1
- Date: Tue, 09 Dec 2025 02:36:26 GMT
- Title: Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model
- Authors: Wenjiang Xu, Cindy Wang, Rui Fang, Mingkang Zhang, Lusong Li, Jing Xu, Jiayuan Gu, Zecui Zeng, Rui Chen,
- Abstract summary: Embodied Tree of Thoughts (EToT) is a novel Real2Sim2Real planning framework.<n>EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms.<n>By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints.
- Score: 12.257547810949482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - From Perception to Action: An Interactive Benchmark for Vision Reasoning [51.11355591375073]
Causal Hierarchy of Actions and Interactions (CHAIN) benchmark designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints.<n> CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing.<n>Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.
arXiv Detail & Related papers (2026-02-24T15:33:02Z) - From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models [4.52033729546524]
A world model is an AI system that simulates how an environment evolves under actions.<n>Current world models suffer from visual conflation: the mistaken assumption that high-fidelity video generation implies an understanding of physical and causal dynamics.<n>We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making.
arXiv Detail & Related papers (2026-01-21T23:35:33Z) - Aligning Agentic World Models via Knowledgeable Experience Learning [68.85843641222186]
We introduce WorldMind, a framework that constructs a symbolic World Knowledge Repository by synthesizing environmental feedback.<n>WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.
arXiv Detail & Related papers (2026-01-19T17:33:31Z) - SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models [60.80050275581661]
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.<n>They lack a grounded understanding of physical dynamics.<n>We present S, a test-time, SIMulation-enabled ACTion Planning framework.<n>Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks.
arXiv Detail & Related papers (2025-12-05T18:51:03Z) - Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain [62.01012517796797]
Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge.<n>Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control.<n>We introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains.
arXiv Detail & Related papers (2025-10-20T17:59:03Z) - ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning [77.49815848173613]
We propose a framework for abstract world models that jointly learns symbolic state representations and causal processes for both endogenous actions and mechanisms.<n>Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
arXiv Detail & Related papers (2025-09-30T13:44:34Z) - OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [50.45036742963495]
We introduce OmniEVA, an embodied versatile planner that enables advanced embodied reasoning and task planning.<n>A Task-Adaptive 3D Grounding mechanism enables context-aware 3D grounding for diverse embodied tasks.<n>An Embodiment-Aware Reasoning framework incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable.
arXiv Detail & Related papers (2025-09-11T10:32:22Z) - SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning [6.255814224573073]
SimGenHOI is a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI.<n>Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose.<n>To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding.
arXiv Detail & Related papers (2025-08-18T15:20:46Z) - Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning [16.193477346643295]
Scan, Materialize, Simulate (SMS) is a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes.<n>Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.
arXiv Detail & Related papers (2025-05-20T21:55:01Z) - DMWM: Dual-Mind World Model with Long-Term Imagination [43.39205414684229]
We propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency.<n>The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite.
arXiv Detail & Related papers (2025-02-11T14:40:57Z) - PhyPlan: Generalizable and Rapid Physical Task Planning with Physics Informed Skill Networks for Robot Manipulators [5.4089975505600005]
Existing methods for physical reasoning are data-hungry and struggle with complexity and uncertainty inherent in the real world.
This paper presents PhyPlan, a physics-informed planning framework that combines physics-informed neural networks (PINNs) with modified Monte Carlo Tree Search (MCTS) to enable embodied agents to perform dynamic physical tasks.
arXiv Detail & Related papers (2024-04-22T06:35:08Z) - PhyPlan: Compositional and Adaptive Physical Task Reasoning with
Physics-Informed Skill Networks for Robot Manipulators [5.680235630702706]
Existing methods for physical reasoning are data-hungry and struggle with complexity and uncertainty inherent in the real world.
This paper presents PhyPlan, a physics-informed planning framework that combines physics-informed neural networks (PINNs) with modified Monte Carlo Tree Search (MCTS) to enable embodied agents to perform dynamic physical tasks.
arXiv Detail & Related papers (2024-02-24T08:51:03Z) - Planning and Execution using Inaccurate Models with Provable Guarantees [23.733488427663396]
We propose CMAX as an approach for interleaving planning and execution.
CMAX adapts its planning strategy online during real-world execution to account for discrepancies in dynamics during planning.
We provide provable guarantees on the completeness and efficiency of the proposed planning and execution framework.
arXiv Detail & Related papers (2020-03-09T20:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.