OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- URL: http://arxiv.org/abs/2402.07456v2
- Date: Thu, 15 Feb 2024 09:30:48 GMT
- Title: OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Authors: Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu,
Shunyu Yao, Tao Yu and Lingpeng Kong
- Abstract summary: We introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS)
We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks.
On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks.
- Score: 48.29860831901484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous interaction with the computer has been a longstanding challenge
with great potential, and the recent proliferation of large language models
(LLMs) has markedly accelerated progress in building digital agents. However,
most of these agents are designed to interact with a narrow domain, such as a
specific software or website. This narrow focus constrains their applicability
for general computer tasks. To this end, we introduce OS-Copilot, a framework
to build generalist agents capable of interfacing with comprehensive elements
in an operating system (OS), including the web, code terminals, files,
multimedia, and various third-party applications. We use OS-Copilot to create
FRIDAY, a self-improving embodied agent for automating general computer tasks.
On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods
by 35%, showcasing strong generalization to unseen applications via accumulated
skills from previous tasks. We also present numerical and quantitative evidence
that FRIDAY learns to control and self-improve on Excel and Powerpoint with
minimal supervision. Our OS-Copilot framework and empirical findings provide
infrastructure and insights for future research toward more capable and
general-purpose computer agents.
Related papers
- A3: Android Agent Arena for Mobile GUI Agents [46.73085454978007]
Mobile GUI agents are designed to autonomously perform tasks on mobile devices.
Android Agent Arena (A3) is a novel evaluation platform for assessing performance on real-world, in-the-wild tasks.
A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios.
arXiv Detail & Related papers (2025-01-02T09:03:56Z) - Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity? [60.629883024152576]
Large Language Models (LLMs) have seen rapid deployment in a wide range of use cases.
OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities.
These methods come with a range of cybersecurity challenges.
arXiv Detail & Related papers (2024-12-19T14:44:41Z) - TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment.
We find that with the most competitive agent, 24% of the tasks can be completed autonomously.
This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z) - AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant [26.571908014508214]
AgentStore is a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks.
We propose a novel core textbfMetaAgent with the textbfAgentToken strategy to efficiently manage diverse agents.
Experiments on three challenging benchmarks demonstrate that AgentStore surpasses the limitations of previous systems with narrow capabilities.
arXiv Detail & Related papers (2024-10-24T09:58:40Z) - Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI)
Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z) - Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence [79.5316642687565]
Existing multi-agent frameworks often struggle with integrating diverse capable third-party agents.
We propose the Internet of Agents (IoA), a novel framework that addresses these limitations.
IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control.
arXiv Detail & Related papers (2024-07-09T17:33:24Z) - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents.
OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks.
We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web [43.60736044871539]
We introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate programs.
The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet"
Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks.
arXiv Detail & Related papers (2024-02-27T14:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.