Related papers: Dynamic Planning for LLM-based Graphical User Interface Automation

Dynamic Planning for LLM-based Graphical User Interface Automation

URL: http://arxiv.org/abs/2410.00467v2
Date: Tue, 22 Oct 2024 10:47:13 GMT
Title: Dynamic Planning for LLM-based Graphical User Interface Automation
Authors: Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang,
Abstract summary: We propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents. D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7%.
Score: 48.31532014795368
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action history.We show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% $\rightarrow$ 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT.

Related papers

Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [101.09478572153239]
We propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments.
arXiv Detail & Related papers (2025-04-22T17:52:42Z)
Plan-over-Graph: Towards Parallelable LLM Agent Schedule [53.834646147919436]
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning. This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph. The model then understands this task graph as input and generates a plan for parallel execution.
arXiv Detail & Related papers (2025-02-20T13:47:51Z)
Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z)
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning [33.8868315479384]
We propose VeriGraph, a framework that integrates vision-language models (VLMs) for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
arXiv Detail & Related papers (2024-11-15T18:59:51Z)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [39.606908488885125]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs) It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities. Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z)
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation [89.68433168477227]
Large Language Model (LLM) based agents have garnered significant attention and are becoming increasingly popular. This paper investigates enhancing the planning abilities of LLMs through instruction tuning. To address this limitation, this paper explores the automated synthesis of diverse environments and a gradual range of planning tasks.
arXiv Detail & Related papers (2024-08-01T17:59:46Z)
Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs) LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z)
Learning Logic Specifications for Policy Guidance in POMDPs: an Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver. We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications. We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z)
AutoGPT+P: Affordance-based Task Planning with Large Language Models [6.848986296339031]
AutoGPT+P is a system that combines an affordance-based scene representation with a planning system. Our approach achieves a success rate of 98%, surpassing the current 81% success rate of the current state-of-the-art LLM-based planning method SayCan.
arXiv Detail & Related papers (2024-02-16T16:00:50Z)
Dynamic Planning with a LLM [15.430182858130884]
Large Language Models (LLMs) can solve many NLP tasks in zero-shot settings, but applications involving embodied agents remain problematic. Our work presents LLM Dynamic Planner (LLM-DP), a neuro-symbolic framework where an LLM works hand-in-hand with a traditional planner to solve an embodied task.
arXiv Detail & Related papers (2023-08-11T21:17:13Z)
Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.