History-Aware Reasoning for GUI Agents
- URL: http://arxiv.org/abs/2511.09127v1
- Date: Thu, 13 Nov 2025 01:34:12 GMT
- Title: History-Aware Reasoning for GUI Agents
- Authors: Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li,
- Abstract summary: Current methods integrate Reinforcement Learning with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement.<n>We propose a History-Aware Reasoning framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge.<n>We develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware.
- Score: 15.519853892615272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users' concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.
Related papers
- EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration [16.593979443102754]
We introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory.<n>First, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model.<n>Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''<n>Third, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process.
arXiv Detail & Related papers (2025-12-22T13:42:18Z) - GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation [25.824982644530326]
We present a reasoning-enhanced framework that integrates structured reasoning, action prediction, and history summarization.<n>This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance.
arXiv Detail & Related papers (2025-10-31T06:10:57Z) - PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents [151.86841216364294]
We propose textbfPAL-UI (textbfPlanning with textbfActive textbfLook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required.<n> PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool.
arXiv Detail & Related papers (2025-10-01T01:48:39Z) - GUI-PRA: Process Reward Agent for GUI Tasks [25.20594694997543]
Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference.<n>PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step.<n>We introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM.
arXiv Detail & Related papers (2025-09-27T11:42:36Z) - GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning [11.909652592163896]
GUI-ReWalk is a multi-stage framework for synthesizing realistic and diverse GUI trajectories.<n>By combining randomness with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects intent-aware, adaptive nature of human-computer interaction.<n>Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent.
arXiv Detail & Related papers (2025-09-19T08:09:18Z) - MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation [6.815990151030097]
Chain-of-Memory (CoM) is a novel approach for explicitly modeling short-term and long-term memory in Graphical User Interface (GUI) agents.<n>CoM enables GUI agents to better understand task states and retain critical historical information persistently.
arXiv Detail & Related papers (2025-06-22T20:17:46Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.