Related papers: AgentStudio: A Toolkit for Building General Virtual Agents

AgentStudio: A Toolkit for Building General Virtual Agents

URL: http://arxiv.org/abs/2403.17918v1
Date: Tue, 26 Mar 2024 17:54:15 GMT
Title: AgentStudio: A Toolkit for Building General Virtual Agents
Authors: Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan,
Abstract summary: We introduce AgentStudio, an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development. This includes environment setups, data collection, agent evaluation, and visualization. We have open-sourced the environments, datasets, benchmarks, and interfaces to promote research towards developing general virtual agents.
Score: 57.02375267926862
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Creating autonomous virtual agents capable of using arbitrary software on any digital device remains a major challenge for artificial intelligence. Two key obstacles hinder progress: insufficient infrastructure for building virtual agents in real-world environments, and the need for in-the-wild evaluation of fundamental agent abilities. To address this, we introduce AgentStudio, an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development. This includes environment setups, data collection, agent evaluation, and visualization. The observation and action spaces are highly generic, supporting both function calling and human-computer interfaces. This versatility is further enhanced by AgentStudio's graphical user interfaces, which allow efficient development of datasets and benchmarks in real-world settings. To illustrate, we introduce a visual grounding dataset and a real-world benchmark suite, both created with our graphical interfaces. Furthermore, we present several actionable insights derived from AgentStudio, e.g., general visual grounding, open-ended tool creation, learning from videos, etc. We have open-sourced the environments, datasets, benchmarks, and interfaces to promote research towards developing general virtual agents for the future.

Related papers

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.57043903478257]
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations.<n>With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality.<n>This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development.
arXiv Detail & Related papers (2025-08-06T14:33:45Z)
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [54.52092001110694]
ThinkGeo is a benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction [16.731754927372585]
We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents. Unlike online benchmarks, UI-Vision provides dense, high-quality annotations of human demonstrations. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B.
arXiv Detail & Related papers (2025-03-19T19:26:17Z)
Programming with Pixels: Computer-Use Meets Software Engineering [24.00640679767529]
General-purpose computer-use agents can approach or even surpass specialized tool-based agents on a variety of SWE tasks without the need for hand-engineered tools. Our results establish PwP as a scalable testbed for building and evaluating the next wave of software engineering agents.
arXiv Detail & Related papers (2025-02-24T18:41:33Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale. We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials. Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements. To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI) Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z)
AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems [31.113305753414913]
AUTOGEN STUDIO is a no-code developer tool for rapidly prototyping multi-agent systems. It provides an intuitive drag-and-drop UI for agent specification, interactive evaluation, and a gallery of reusable agent components.
arXiv Detail & Related papers (2024-08-09T03:27:37Z)
GTA: A Benchmark for General Tool Agents [32.443456248222695]
We design 229 real-world tasks and executable tool chains to evaluate mainstream large language models (LLMs) Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents.
arXiv Detail & Related papers (2024-07-11T17:50:09Z)
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
WebCanvas: Benchmarking Web Agents in Online Environments [29.278363444725628]
WebCanvas is an innovative online evaluation framework for web agents. We open-source an agent framework with modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set.
arXiv Detail & Related papers (2024-06-18T07:58:33Z)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks. We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
Robust Object Detection via Instance-Level Temporal Cycle Confusion [89.1027433760578]
We study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors. Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf) For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision.
arXiv Detail & Related papers (2021-04-16T21:35:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.