OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- URL: http://arxiv.org/abs/2402.07456v2
- Date: Thu, 15 Feb 2024 09:30:48 GMT
- Title: OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Authors: Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu,
Shunyu Yao, Tao Yu and Lingpeng Kong
- Abstract summary: We introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS)
We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks.
On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks.
- Score: 48.29860831901484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous interaction with the computer has been a longstanding challenge
with great potential, and the recent proliferation of large language models
(LLMs) has markedly accelerated progress in building digital agents. However,
most of these agents are designed to interact with a narrow domain, such as a
specific software or website. This narrow focus constrains their applicability
for general computer tasks. To this end, we introduce OS-Copilot, a framework
to build generalist agents capable of interfacing with comprehensive elements
in an operating system (OS), including the web, code terminals, files,
multimedia, and various third-party applications. We use OS-Copilot to create
FRIDAY, a self-improving embodied agent for automating general computer tasks.
On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods
by 35%, showcasing strong generalization to unseen applications via accumulated
skills from previous tasks. We also present numerical and quantitative evidence
that FRIDAY learns to control and self-improve on Excel and Powerpoint with
minimal supervision. Our OS-Copilot framework and empirical findings provide
infrastructure and insights for future research toward more capable and
general-purpose computer agents.
Related papers
- AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant [26.571908014508214]
AgentStore is a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks.
We propose a novel core textbfMetaAgent with the textbfAgentToken strategy to efficiently manage diverse agents.
Experiments on three challenging benchmarks demonstrate that AgentStore surpasses the limitations of previous systems with narrow capabilities.
arXiv Detail & Related papers (2024-10-24T09:58:40Z) - Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI)
Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z) - OpenHands: An Open Platform for AI Software Developers as Generalist Agents [109.8507367518992]
We introduce OpenHands, a platform for the development of AI agents that interact with the world in similar ways to a human developer.
We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, and incorporation of evaluation benchmarks.
arXiv Detail & Related papers (2024-07-23T17:50:43Z) - Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence [79.5316642687565]
Existing multi-agent frameworks often struggle with integrating diverse capable third-party agents.
We propose the Internet of Agents (IoA), a novel framework that addresses these limitations.
IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control.
arXiv Detail & Related papers (2024-07-09T17:33:24Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot [22.03327808115817]
We propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot) to enhance interaction ability with operating systems.
The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge.
MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8% over existing leading systems.
arXiv Detail & Related papers (2024-04-28T05:33:15Z) - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents.
OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks.
We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web [43.60736044871539]
We introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate programs.
The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet"
Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks.
arXiv Detail & Related papers (2024-02-27T14:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.