ScreenAgent: A Vision Language Model-driven Computer Control Agent
- URL: http://arxiv.org/abs/2402.07945v1
- Date: Fri, 9 Feb 2024 02:33:45 GMT
- Title: ScreenAgent: A Vision Language Model-driven Computer Control Agent
- Authors: Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng,
He Kong, Yi Chang, Qi Wang
- Abstract summary: We build an environment for a Vision Language Model (VLM) agent to interact with a real computer screen.
Within this environment, the agent can observe screenshots and manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard actions.
We construct the ScreenAgent dataset, which collects screenshots and action sequences when completing a variety of daily computer tasks.
- Score: 17.11085071288194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing Large Language Models (LLM) can invoke a variety of tools and APIs
to complete complex tasks. The computer, as the most powerful and universal
tool, could potentially be controlled directly by a trained LLM agent. Powered
by the computer, we can hopefully build a more generalized agent to assist
humans in various daily digital works. In this paper, we construct an
environment for a Vision Language Model (VLM) agent to interact with a real
computer screen. Within this environment, the agent can observe screenshots and
manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard
actions. We also design an automated control pipeline that includes planning,
acting, and reflecting phases, guiding the agent to continuously interact with
the environment and complete multi-step tasks. Additionally, we construct the
ScreenAgent Dataset, which collects screenshots and action sequences when
completing a variety of daily computer tasks. Finally, we trained a model,
ScreenAgent, which achieved computer control capabilities comparable to GPT-4V
and demonstrated more precise UI positioning capabilities. Our attempts could
inspire further research on building a generalist LLM agent. The code is
available at \url{https://github.com/niuzaisheng/ScreenAgent}.
Related papers
- PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.
From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.
From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment.
We find that with the most competitive agent, 24% of the tasks can be completed autonomously.
This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z) - Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI)
Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z) - ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents [0.0]
ClickAgent is a novel framework for building autonomous agents.
In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model identifies the relevant UI elements on the screen.
Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.
arXiv Detail & Related papers (2024-10-09T14:49:02Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images.
By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data.
Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web [43.60736044871539]
We introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate programs.
The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet"
Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks.
arXiv Detail & Related papers (2024-02-27T14:47:53Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z) - A Zero-Shot Language Agent for Computer Control with Structured
Reflection [19.526676887048662]
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment.
To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting.
We approach this problem with a zero-shot agent that requires no given expert traces.
arXiv Detail & Related papers (2023-10-12T21:53:37Z) - Agents: An Open-source Framework for Autonomous Language Agents [98.91085725608917]
We consider language agents as a promising direction towards artificial general intelligence.
We release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience.
arXiv Detail & Related papers (2023-09-14T17:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.