Related papers: Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

URL: http://arxiv.org/abs/2512.08188v1
Date: Tue, 09 Dec 2025 02:36:26 GMT
Title: Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model
Authors: Wenjiang Xu, Cindy Wang, Rui Fang, Mingkang Zhang, Lusong Li, Jing Xu, Jiayuan Gu, Zecui Zeng, Rui Chen,
Abstract summary: Embodied Tree of Thoughts (EToT) is a novel Real2Sim2Real planning framework.<n>EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms.<n>By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints.
Score: 12.257547810949482
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .

Related papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z)
From Perception to Action: An Interactive Benchmark for Vision Reasoning [51.11355591375073]
Causal Hierarchy of Actions and Interactions (CHAIN) benchmark designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints.<n> CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing.<n>Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.
arXiv Detail & Related papers (2026-02-24T15:33:02Z)
From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models [4.52033729546524]
A world model is an AI system that simulates how an environment evolves under actions.<n>Current world models suffer from visual conflation: the mistaken assumption that high-fidelity video generation implies an understanding of physical and causal dynamics.<n>We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making.
arXiv Detail & Related papers (2026-01-21T23:35:33Z)
Aligning Agentic World Models via Knowledgeable Experience Learning [68.85843641222186]
We introduce WorldMind, a framework that constructs a symbolic World Knowledge Repository by synthesizing environmental feedback.<n>WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.
arXiv Detail & Related papers (2026-01-19T17:33:31Z)
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models [60.80050275581661]
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.<n>They lack a grounded understanding of physical dynamics.<n>We present S, a test-time, SIMulation-enabled ACTion Planning framework.<n>Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks.
arXiv Detail & Related papers (2025-12-05T18:51:03Z)
Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain [62.01012517796797]
Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge.<n>Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control.<n>We introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains.
arXiv Detail & Related papers (2025-10-20T17:59:03Z)
ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning [77.49815848173613]
We propose a framework for abstract world models that jointly learns symbolic state representations and causal processes for both endogenous actions and mechanisms.<n>Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
arXiv Detail & Related papers (2025-09-30T13:44:34Z)
OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [50.45036742963495]
We introduce OmniEVA, an embodied versatile planner that enables advanced embodied reasoning and task planning.<n>A Task-Adaptive 3D Grounding mechanism enables context-aware 3D grounding for diverse embodied tasks.<n>An Embodiment-Aware Reasoning framework incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable.
arXiv Detail & Related papers (2025-09-11T10:32:22Z)
SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning [6.255814224573073]
SimGenHOI is a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI.<n>Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose.<n>To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding.
arXiv Detail & Related papers (2025-08-18T15:20:46Z)
Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning [16.193477346643295]
Scan, Materialize, Simulate (SMS) is a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes.<n>Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.
arXiv Detail & Related papers (2025-05-20T21:55:01Z)
DMWM: Dual-Mind World Model with Long-Term Imagination [43.39205414684229]
We propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency.<n>The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite.
arXiv Detail & Related papers (2025-02-11T14:40:57Z)
PhyPlan: Generalizable and Rapid Physical Task Planning with Physics Informed Skill Networks for Robot Manipulators [5.4089975505600005]
Existing methods for physical reasoning are data-hungry and struggle with complexity and uncertainty inherent in the real world. This paper presents PhyPlan, a physics-informed planning framework that combines physics-informed neural networks (PINNs) with modified Monte Carlo Tree Search (MCTS) to enable embodied agents to perform dynamic physical tasks.
arXiv Detail & Related papers (2024-04-22T06:35:08Z)
PhyPlan: Compositional and Adaptive Physical Task Reasoning with Physics-Informed Skill Networks for Robot Manipulators [5.680235630702706]
Existing methods for physical reasoning are data-hungry and struggle with complexity and uncertainty inherent in the real world. This paper presents PhyPlan, a physics-informed planning framework that combines physics-informed neural networks (PINNs) with modified Monte Carlo Tree Search (MCTS) to enable embodied agents to perform dynamic physical tasks.
arXiv Detail & Related papers (2024-02-24T08:51:03Z)
Planning and Execution using Inaccurate Models with Provable Guarantees [23.733488427663396]
We propose CMAX as an approach for interleaving planning and execution. CMAX adapts its planning strategy online during real-world execution to account for discrepancies in dynamics during planning. We provide provable guarantees on the completeness and efficiency of the proposed planning and execution framework.
arXiv Detail & Related papers (2020-03-09T20:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.