Related papers: UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

URL: http://arxiv.org/abs/2505.21964v2
Date: Mon, 03 Nov 2025 08:44:04 GMT
Title: UI-Evol: Automatic Knowledge Evolving for Computer Use Agents
Authors: Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu,
Abstract summary: We propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution.<n> UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge.<n>Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents.
Score: 23.21178608410048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

Related papers

K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control [73.50217471850658]
K2-Agent is a hierarchical framework that models human-like cognition by knowing and co-evolving declarative (what) and procedural (how) knowledge for planning and execution.<n>On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw and open-source backbones.
arXiv Detail & Related papers (2026-02-28T14:33:14Z)
Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation [57.65688895630163]
We introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data.<n>Our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without forgetting existing environments.
arXiv Detail & Related papers (2026-02-10T23:06:02Z)
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z)
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration [16.593979443102754]
We introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory.<n>First, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model.<n>Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''<n>Third, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process.
arXiv Detail & Related papers (2025-12-22T13:42:18Z)
Real-Time Procedural Learning From Experience for AI Agents [2.543194442104227]
We propose Procedural Recall for Agents with eXperiences Indexed by State (PRAXIS)<n>PRAXIS stores the consequences of actions and retrieves them by jointly matching environmental and internal states of past episodes to the current state.<n> PRAXIS augments agentic action selection with retrieved state-action-result exemplars that are generated in real time.
arXiv Detail & Related papers (2025-11-27T03:51:49Z)
Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation [57.12284831164602]
Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks.<n>We propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation.
arXiv Detail & Related papers (2025-11-15T15:22:42Z)
Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing [23.554239007767276]
We introduce the first real-world, agent-oriented pentesting benchmark, TermiBench.<n>We propose TermiAgent, a multi-agent penetration testing framework.<n>In evaluations, our work outperforms state-of-the-art agents, exhibiting stronger penetration testing capability.
arXiv Detail & Related papers (2025-09-11T07:30:44Z)
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience [71.82719117238307]
We propose SEAgent, an agentic self-evolving framework enabling computer-use agents to evolve through interactions with unfamiliar software.<n>We validate the effectiveness of SEAgent across five novel software environments within OS-World.<n>Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA.
arXiv Detail & Related papers (2025-08-06T17:58:46Z)
Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents [15.303188467166752]
We present CogniGUI, a cognitive framework developed to overcome limitations by enabling adaptive learning for GUI automation resembling human-like behavior.<n>To assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence.<n> Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
arXiv Detail & Related papers (2025-06-22T06:30:52Z)
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners [51.518410910148816]
Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time.<n>We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents.
arXiv Detail & Related papers (2025-05-17T10:09:11Z)
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.253353551910404]
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices.<n>We introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models.<n>Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks.
arXiv Detail & Related papers (2025-04-01T15:40:27Z)
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users [34.70342284525283]
We propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility.<n>Our approach incorporates a memory mechanism that records the agent's task execution history.<n> Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy.
arXiv Detail & Related papers (2025-03-04T04:34:09Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization [56.674356045200696]
We propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for cumbersome note systems or prior high-quality demonstration data.<n>Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights.<n>We demonstrate the efficacy of our approach by implementing it in a Llama-3-based agent that, after only a few rounds of feedback, outperforms advanced models GPT-4o and DeepSeek-V3 in tasksets.
arXiv Detail & Related papers (2025-02-03T17:45:46Z)
Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI) Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z)
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers. WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform. BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z)
Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.