Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
- URL: http://arxiv.org/abs/2504.00906v1
- Date: Tue, 01 Apr 2025 15:40:27 GMT
- Title: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
- Authors: Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang,
- Abstract summary: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices.<n>We introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models.<n>Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks.
- Score: 30.253353551910404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.
Related papers
- STEVE: A Step Verification Pipeline for Computer-use Agent Training [84.24814828303163]
STEVE is a step verification pipeline for computer-use agent training.<n> GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution.<n>Our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory.
arXiv Detail & Related papers (2025-03-16T14:53:43Z) - PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.<n>From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.<n>From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - A3: Android Agent Arena for Mobile GUI Agents [46.73085454978007]
Mobile GUI agents are designed to autonomously perform tasks on mobile devices.<n>Android Agent Arena (A3) is a novel evaluation platform for assessing performance on real-world, in-the-wild tasks.<n>A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios.
arXiv Detail & Related papers (2025-01-02T09:03:56Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)
Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.
These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant [26.571908014508214]
AgentStore is a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks.
We propose a novel core textbfMetaAgent with the textbfAgentToken strategy to efficiently manage diverse agents.
Experiments on three challenging benchmarks demonstrate that AgentStore surpasses the limitations of previous systems with narrow capabilities.
arXiv Detail & Related papers (2024-10-24T09:58:40Z) - Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI)
Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.