Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach
- URL: http://arxiv.org/abs/2505.16422v2
- Date: Sat, 07 Jun 2025 02:50:58 GMT
- Title: Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach
- Authors: Xiaoran Yin, Xu Luo, Hao Wu, Lianli Gao, Jingkuan Song,
- Abstract summary: We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
- Score: 83.21177515180564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The automatic control of mobile devices is essential for efficiently performing complex tasks that involve multiple sequential steps. However, these tasks pose significant challenges due to the limited environmental information available at each step, primarily through visual observations. As a result, current approaches, which typically rely on reactive policies, focus solely on immediate observations and often lead to suboptimal decision-making. To address this problem, we propose \textbf{Foresighted Planning with World Model-Driven Code Execution (FPWC)},a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment by developing a task-oriented, refinable \emph{world model} at the outset of the task. Foresighted actions are subsequently generated through iterative planning within this world model, executed in the form of executable code. Extensive experiments conducted in simulated environments and on real mobile devices demonstrate that our method outperforms previous approaches, particularly achieving a 44.4\% relative improvement in task success rate compared to the state-of-the-art in the simulated environment. Code and demo are provided in the supplementary material.
Related papers
- Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z) - MORE: Mobile Manipulation Rearrangement Through Grounded Language Reasoning [13.535721260188694]
MORE is a novel approach for enhancing the capabilities of language models to solve zero-shot mobile manipulation planning tasks.<n>We evaluate MORE on 81 diverse rearrangement tasks from the BEHAVIOR-1K benchmark, where it becomes the first approach to successfully solve a significant share of the benchmark.
arXiv Detail & Related papers (2025-05-05T21:26:03Z) - Onto-LLM-TAMP: Knowledge-oriented Task and Motion Planning using Large Language Models [0.21990652930491858]
This work proposes a novel Onto-LLM-TAMP framework that employs knowledge-based reasoning to refine and expand user prompts with task-contextual reasoning and knowledge-based environment state descriptions.<n>The proposed framework is validated through both simulation and real-world scenarios, demonstrating significant improvements over the baseline approach in terms of adaptability to dynamic environments and the generation of semantically correct task plans.
arXiv Detail & Related papers (2024-12-10T13:18:45Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - A Meta-Engine Framework for Interleaved Task and Motion Planning using Topological Refinements [51.54559117314768]
Task And Motion Planning (TAMP) is the problem of finding a solution to an automated planning problem.
We propose a general and open-source framework for modeling and benchmarking TAMP problems.
We introduce an innovative meta-technique to solve TAMP problems involving moving agents and multiple task-state-dependent obstacles.
arXiv Detail & Related papers (2024-08-11T14:57:57Z) - LLM-SAP: Large Language Models Situational Awareness Based Planning [0.0]
We employ a multi-agent reasoning framework to develop a methodology that anticipates and actively mitigates potential risks.
Our approach diverges from traditional automata theory by incorporating the complexity of human-centric interactions into the planning process.
arXiv Detail & Related papers (2023-12-26T17:19:09Z) - Novelty Accommodating Multi-Agent Planning in High Fidelity Simulated Open World [7.821603097781892]
We address the challenge that arises when unexpected phenomena, termed textitnovelties, emerge within the environment.<n>The introduction of novelties into the environment can lead to inaccuracies within the planner's internal model, rendering previously generated plans obsolete.<n>We propose a general purpose AI agent framework designed to detect, characterize, and adapt to support concurrent actions and external scheduling.
arXiv Detail & Related papers (2023-06-22T03:44:04Z) - Continual Visual Reinforcement Learning with A Life-Long World Model [55.05017177980985]
We present a new continual learning approach for visual dynamics modeling.<n>We first introduce the life-long world model, which learns task-specific latent dynamics.<n>Then, we address the value estimation challenge for previous tasks with the exploratory-conservative behavior learning approach.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - Temporal Predictive Coding For Model-Based Planning In Latent Space [80.99554006174093]
We present an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time.
We evaluate our model on a challenging modification of standard DMControl tasks where the background is replaced with natural videos that contain complex but irrelevant information to the planning task.
arXiv Detail & Related papers (2021-06-14T04:31:15Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.