Related papers: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

URL: http://arxiv.org/abs/2504.00906v1
Date: Tue, 01 Apr 2025 15:40:27 GMT
Title: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Authors: Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang,
Abstract summary: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices.<n>We introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models.<n>Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks.
Score: 30.253353551910404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

Related papers

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation [57.12284831164602]
Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks.<n>We propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation.
arXiv Detail & Related papers (2025-11-15T15:22:42Z)
TOM-SWE: User Mental Modeling For Software Engineering Agents [75.28749912645127]
ToM-SWE is a dual-agent architecture that pairs a primary software-engineering (SWE) agent with a lightweight theory-of-mind (ToM) partner agent.<n>ToM-SWE infers user goals, constraints, and preferences from instructions and interaction history.<n>In two software engineering benchmarks, ToM-SWE improves task success rates and user satisfaction.
arXiv Detail & Related papers (2025-10-24T16:09:51Z)
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents [39.3943822850841]
We introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm.<n>We show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art.
arXiv Detail & Related papers (2025-09-09T09:46:01Z)
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.57043903478257]
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations.<n>With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality.<n>This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development.
arXiv Detail & Related papers (2025-08-06T14:33:45Z)
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis [59.83524388782554]
Graphical user interface (GUI) grounding remains a critical bottleneck in computer use agent development.<n>We introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types.<n>We synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples.
arXiv Detail & Related papers (2025-05-19T15:09:23Z)
STEVE: A Step Verification Pipeline for Computer-use Agent Training [84.24814828303163]
STEVE is a step verification pipeline for computer-use agent training.<n> GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution.<n>Our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory.
arXiv Detail & Related papers (2025-03-16T14:53:43Z)
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.<n>From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.<n>From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z)
A3: Android Agent Arena for Mobile GUI Agents [46.73085454978007]
Mobile GUI agents are designed to autonomously perform tasks on mobile devices.<n>Android Agent Arena (A3) is a novel evaluation platform for assessing performance on real-world, in-the-wild tasks.<n>A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios.
arXiv Detail & Related papers (2025-01-02T09:03:56Z)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations. These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant [26.571908014508214]
AgentStore is a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks. We propose a novel core textbfMetaAgent with the textbfAgentToken strategy to efficiently manage diverse agents. Experiments on three challenging benchmarks demonstrate that AgentStore surpasses the limitations of previous systems with narrow capabilities.
arXiv Detail & Related papers (2024-10-24T09:58:40Z)
Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI) Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z)
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents. We take the first step towards building generally-capable LLM-based agents with self-evolution ability. We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z)
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.