MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
- URL: http://arxiv.org/abs/2510.24168v1
- Date: Tue, 28 Oct 2025 08:19:58 GMT
- Title: MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
- Authors: Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, Ding Wang,
- Abstract summary: We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide.<n>MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines.
- Score: 30.45490249299358
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.
Related papers
- RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation [71.2136732268131]
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions.<n>Existing RGBT trackers rely solely on initial-frame visual information for target modeling.<n>We propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking.
arXiv Detail & Related papers (2026-03-04T01:02:04Z) - VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph [42.348770377488094]
VimRAG is a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos.<n>We propose a Graph-Guided Policy Optimization strategy to disentangle step-wise validity from trajectory-level rewards.<n>Experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.
arXiv Detail & Related papers (2026-02-13T09:05:09Z) - GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z) - ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z) - OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent [58.07447442040785]
We introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation.<n>Results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales.
arXiv Detail & Related papers (2026-01-12T17:55:51Z) - MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements [7.2364254826655925]
MEGA-GUI is a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding.<n> MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity.<n>On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results.
arXiv Detail & Related papers (2025-11-17T07:38:05Z) - Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z) - GUI-PRA: Process Reward Agent for GUI Tasks [25.20594694997543]
Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference.<n>PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step.<n>We introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM.
arXiv Detail & Related papers (2025-09-27T11:42:36Z) - Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation [6.815990151030097]
Chain-of-Memory (CoM) is a novel approach for explicitly modeling short-term and long-term memory in Graphical User Interface (GUI) agents.<n>CoM enables GUI agents to better understand task states and retain critical historical information persistently.
arXiv Detail & Related papers (2025-06-22T20:17:46Z) - MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z) - G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems [44.844636264484905]
Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents.<n>We introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory.<n>G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89%$ and $10.12%$, respectively.
arXiv Detail & Related papers (2025-06-09T03:43:46Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation [69.01029651113386]
Embodied-RAG is a framework that enhances the model of an embodied agent with a non-parametric memory system.<n>At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail.<n>We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries.
arXiv Detail & Related papers (2024-09-26T21:44:11Z) - Sports-Traj: A Unified Trajectory Generation Model for Multi-Agent Movement in Sports [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.<n>Specifically, we introduce a Ghost Spatial Masking (GSM) module, embedded within a Transformer encoder, for spatial feature extraction.<n>We benchmark three practical sports datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.