Related papers: Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation

Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation

URL: http://arxiv.org/abs/2506.18158v1
Date: Sun, 22 Jun 2025 20:17:46 GMT
Title: Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation
Authors: Xinzge Gao, Chuanrui Hu, Bin Chen, Teng Li,
Abstract summary: Chain-of-Memory (CoM) is a novel approach for explicitly modeling short-term and long-term memory in Graphical User Interface (GUI) agents.<n>CoM enables GUI agents to better understand task states and retain critical historical information persistently.
Score: 6.815990151030097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) are attracting growing attention in the development of Graphical User Interface (GUI) agents. Existing approaches often rely on historical screenshots or actions to implicitly represent the task state. This reliance poses challenges for GUI agents in accurately understanding task states and underscores the absence of effective mechanisms to store critical information in complex and lengthy cross-app tasks. To address these challenges, we propose Chain-of-Memory (CoM), a novel approach for explicitly modeling short-term and long-term memory in GUI agents. CoM achieves this by capturing action descriptions, integrating task-relevant screen information, and maintaining a dedicated memory module to store and manage this information. By leveraging explicit memory representations, CoM enables GUI agents to better understand task states and retain critical historical information persistently. To equip GUI agents with memory management capabilities and evaluate the effectiveness of CoM, we developed the GUI Odyssey-CoM, a dataset comprising 111k screen-action pairs annotated with Chain-of-Memory. Experimental results demonstrate that CoM significantly improves GUI agents' performance in cross-application tasks. Additionally, GUI Odyssey-CoM enables 7B models to achieve memory management capabilities comparable to 72B models. The dataset and code will be open-sourced.

Related papers

ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory [3.279665979821265]
ActionEngine is a training-free framework that transitions from reactive execution to programmatic planning.<n>Our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline.
arXiv Detail & Related papers (2026-02-24T03:03:18Z)
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents [57.38404718635204]
Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows.<n>Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components.<n>We propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy.
arXiv Detail & Related papers (2026-01-05T08:24:16Z)
History-Aware Reasoning for GUI Agents [15.519853892615272]
Current methods integrate Reinforcement Learning with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement.<n>We propose a History-Aware Reasoning framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge.<n>We develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware.
arXiv Detail & Related papers (2025-11-12T09:06:25Z)
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction [30.45490249299358]
We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide.<n>MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-10-28T08:19:58Z)
PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents [151.86841216364294]
We propose textbfPAL-UI (textbfPlanning with textbfActive textbfLook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required.<n> PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool.
arXiv Detail & Related papers (2025-10-01T01:48:39Z)
GUI-PRA: Process Reward Agent for GUI Tasks [25.20594694997543]
Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference.<n>PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step.<n>We introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM.
arXiv Detail & Related papers (2025-09-27T11:42:36Z)
MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation [5.433829353194621]
MapAgent is a framework that leverages memory constructed from historical trajectories to augment current task planning.<n>We introduce a coarse-to-fine task planning approach that retrieves relevant pages from the memory database based on similarity.<n>Results in real-world scenarios demonstrate that MapAgent achieves superior performance to existing methods.
arXiv Detail & Related papers (2025-07-29T16:05:32Z)
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z)
Less is More: Empowering GUI Agent with Context-Aware Simplification [62.02157661751793]
We propose a context-aware framework for building an efficient and effective GUI Agent, termed SimpAgent.<n>With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances.
arXiv Detail & Related papers (2025-07-04T17:37:15Z)
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z)
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.89792845476579]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z)
MAPLE: A Mobile Agent with Persistent Finite State Machines for Structured Task Reasoning [46.18718721121415]
We present MAPLE, a state-aware multi-agent framework that abstracts app interactions as a Finite State Machine (FSM)<n>We computationally model each UI screen as a discrete state and user actions as transitions, allowing the FSM to provide a structured representation of the app execution.<n> MAPLE consists of specialized agents responsible for four phases of task execution: planning, execution, verification, error recovery, and knowledge retention.
arXiv Detail & Related papers (2025-05-29T16:08:51Z)
Task Memory Engine (TME): A Structured Memory Framework with Graph-Aware Extensions for Multi-Step LLM Agent Tasks [0.0]
We propose a lightweight and structured memory module that tracks task execution using a hierarchical Task Memory Tree (TMT)<n>TME is designed to be graph-aware, supporting reusable substeps, converging task paths, and shared dependencies.
arXiv Detail & Related papers (2025-04-11T13:38:36Z)
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding [73.9254861755974]
This paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations.<n>We evaluate the capabilities of current state-of-the-art MLLMs, including Image LLMs and Video LLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z)
Memory Sharing for Large Language Model based Agents [43.53494041932615]
This paper introduces the Memory Sharing, a framework which integrates the real-time memory filter, storage and retrieval to enhance the In-Context Learning process. The experimental results demonstrate that the MS framework significantly improves the agents' performance in addressing open-ended questions.
arXiv Detail & Related papers (2024-04-15T17:57:30Z)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP) With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.