Related papers: ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

URL: http://arxiv.org/abs/2602.20502v1
Date: Tue, 24 Feb 2026 03:03:18 GMT
Title: ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory
Authors: Hongbin Zhong, Fazle Faisal, Luis França, Tanakorn Leesatapornwongsa, Adriana Szekeres, Kexin Rong, Suman Nath,
Abstract summary: ActionEngine is a training-free framework that transitions from reactive execution to programmatic planning.<n>Our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline.
Score: 3.279665979821265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models--taking a screenshot, reasoning about the next action, executing it, then repeating on the new page--resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the WebArena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline, while reducing cost by 11.8x and end-to-end latency by 2x. Together, these components yield scalable and reliable GUI interaction by combining global programmatic planning, crawler-validated action templates, and node-level execution with localized validation and repair.

Related papers

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces [65.11019654023978]
LongCLI-Bench is a benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks.<n>We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world tasks.<n>Experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench.
arXiv Detail & Related papers (2026-02-15T23:12:57Z)
ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z)
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration [16.593979443102754]
We introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory.<n>First, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model.<n>Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''<n>Third, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process.
arXiv Detail & Related papers (2025-12-22T13:42:18Z)
GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z)
CoAct-1: Computer-using Agents with Coding as Actions [94.99657662893338]
CoAct-1 is a novel multi-agent system that combines GUI-based control with direct programmatic execution.<n>We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%.
arXiv Detail & Related papers (2025-08-05T21:33:36Z)
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z)
Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation [6.815990151030097]
Chain-of-Memory (CoM) is a novel approach for explicitly modeling short-term and long-term memory in Graphical User Interface (GUI) agents.<n>CoM enables GUI agents to better understand task states and retain critical historical information persistently.
arXiv Detail & Related papers (2025-06-22T20:17:46Z)
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z)
MAPLE: A Mobile Agent with Persistent Finite State Machines for Structured Task Reasoning [46.18718721121415]
We present MAPLE, a state-aware multi-agent framework that abstracts app interactions as a Finite State Machine (FSM)<n>We computationally model each UI screen as a discrete state and user actions as transitions, allowing the FSM to provide a structured representation of the app execution.<n> MAPLE consists of specialized agents responsible for four phases of task execution: planning, execution, verification, error recovery, and knowledge retention.
arXiv Detail & Related papers (2025-05-29T16:08:51Z)
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [30.693616802332745]
This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We propose an advanced Actor-Critic framework, which incorporates a sophisticated GUI driven by an AI agent and adept at handling lengthy procedural tasks.
arXiv Detail & Related papers (2023-12-20T15:28:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.