Related papers: MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

URL: http://arxiv.org/abs/2510.24168v1
Date: Tue, 28 Oct 2025 08:19:58 GMT
Title: MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
Authors: Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, Ding Wang,
Abstract summary: We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide.<n>MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines.
Score: 30.45490249299358
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.

Related papers

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation [71.2136732268131]
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions.<n>Existing RGBT trackers rely solely on initial-frame visual information for target modeling.<n>We propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking.
arXiv Detail & Related papers (2026-03-04T01:02:04Z)
VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph [42.348770377488094]
VimRAG is a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos.<n>We propose a Graph-Guided Policy Optimization strategy to disentangle step-wise validity from trajectory-level rewards.<n>Experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.
arXiv Detail & Related papers (2026-02-13T09:05:09Z)
GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z)
ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z)
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent [58.07447442040785]
We introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation.<n>Results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales.
arXiv Detail & Related papers (2026-01-12T17:55:51Z)
MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements [7.2364254826655925]
MEGA-GUI is a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding.<n> MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity.<n>On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results.
arXiv Detail & Related papers (2025-11-17T07:38:05Z)
Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z)
GUI-PRA: Process Reward Agent for GUI Tasks [25.20594694997543]
Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference.<n>PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step.<n>We introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM.
arXiv Detail & Related papers (2025-09-27T11:42:36Z)
Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation [6.815990151030097]
Chain-of-Memory (CoM) is a novel approach for explicitly modeling short-term and long-term memory in Graphical User Interface (GUI) agents.<n>CoM enables GUI agents to better understand task states and retain critical historical information persistently.
arXiv Detail & Related papers (2025-06-22T20:17:46Z)
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z)
G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems [44.844636264484905]
Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents.<n>We introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory.<n>G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89%$ and $10.12%$, respectively.
arXiv Detail & Related papers (2025-06-09T03:43:46Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation [69.01029651113386]
Embodied-RAG is a framework that enhances the model of an embodied agent with a non-parametric memory system.<n>At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail.<n>We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries.
arXiv Detail & Related papers (2024-09-26T21:44:11Z)
Sports-Traj: A Unified Trajectory Generation Model for Multi-Agent Movement in Sports [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.<n>Specifically, we introduce a Ghost Spatial Masking (GSM) module, embedded within a Transformer encoder, for spatial feature extraction.<n>We benchmark three practical sports datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.