Related papers: Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

URL: http://arxiv.org/abs/2505.14141v1
Date: Tue, 20 May 2025 09:45:55 GMT
Title: Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent
Authors: Fanglin Mo, Junzhe Chen, Haoxuan Zhu, Xuming Hu,
Abstract summary: We propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks.<n>SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.
Score: 13.259836345131525
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become "lost" during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.

Related papers

MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation [5.433829353194621]
MapAgent is a framework that leverages memory constructed from historical trajectories to augment current task planning.<n>We introduce a coarse-to-fine task planning approach that retrieves relevant pages from the memory database based on similarity.<n>Results in real-world scenarios demonstrate that MapAgent achieves superior performance to existing methods.
arXiv Detail & Related papers (2025-07-29T16:05:32Z)
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z)
Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models [63.765846080050906]
This paper proposes a novel parameter-efficient action planner using large language models (PEAP-LLM) to generate a single-step instruction at each location.<n>Experiments show the superiority of our proposed model on REVERIE compared to the previous state-of-the-art.
arXiv Detail & Related papers (2025-05-12T12:38:20Z)
CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning [18.826366389246385]
We propose a new mobile assistant architecture with constrained high-frequency optimized planning (CHOP)<n>Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector.<n>We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency.
arXiv Detail & Related papers (2025-03-05T18:56:16Z)
Plan-over-Graph: Towards Parallelable LLM Agent Schedule [53.834646147919436]
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning.<n>This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph.<n>The model then understands this task graph as input and generates a plan for parallel execution.
arXiv Detail & Related papers (2025-02-20T13:47:51Z)
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning [33.8868315479384]
We propose VeriGraph, a framework that integrates vision-language models (VLMs) for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
arXiv Detail & Related papers (2024-11-15T18:59:51Z)
Dynamic Planning for LLM-based Graphical User Interface Automation [48.31532014795368]
We propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.<n>D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history.<n> Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7%.
arXiv Detail & Related papers (2024-10-01T07:49:24Z)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software. Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z)
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [30.693616802332745]
This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We propose an advanced Actor-Critic framework, which incorporates a sophisticated GUI driven by an AI agent and adept at handling lengthy procedural tasks.
arXiv Detail & Related papers (2023-12-20T15:28:38Z)
Learning adaptive planning representations with natural language guidance [90.24449752926866]
This paper describes Ada, a framework for automatically constructing task-specific planning representations. Ada interactively learns a library of planner-compatible high-level action abstractions and low-level controllers adapted to a particular domain of planning tasks.
arXiv Detail & Related papers (2023-12-13T23:35:31Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
Long-Horizon Planning and Execution with Functional Object-Oriented Networks [79.94575713911189]
We introduce the idea of exploiting object-level knowledge as a FOON for task planning and execution. Our approach automatically transforms FOON into PDDL and leverages off-the-shelf planners, action contexts, and robot skills. We demonstrate our approach on long-horizon tasks in CoppeliaSim and show how learned action contexts can be extended to never-before-seen scenarios.
arXiv Detail & Related papers (2022-07-12T19:29:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.