Related papers: A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

URL: http://arxiv.org/abs/2501.16150v2
Date: Wed, 04 Jun 2025 10:30:14 GMT
Title: A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
Authors: Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann,
Abstract summary: Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices.<n>Despite rapid progress, ACUs are not yet mature for everyday use.
Score: 4.904229981437243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices - such as desktops, mobile phones, and web platforms - given instructions in natural language. These agents can automate tasks by controlling software via low-level actions like mouse clicks and touchscreen gestures. However, despite rapid progress, ACUs are not yet mature for everyday use. In this survey, we investigate the state-of-the-art, trends, and research gaps in the development of practical ACUs. We provide a comprehensive review of the ACU landscape, introducing a unifying taxonomy spanning three dimensions: (I) the domain perspective, characterizing agent operating contexts; (II) the interaction perspective, describing observation modalities (e.g., screenshots, HTML) and action modalities (e.g., mouse, keyboard, code execution); and (III) the agent perspective, detailing how agents perceive, reason, and learn. We review 87 ACUs and 33 datasets across foundation model-based and classical approaches through this taxonomy. Our analysis identifies six major research gaps: insufficient generalization, inefficient learning, limited planning, low task complexity in benchmarks, non-standardized evaluation, and a disconnect between research and practical conditions. To address these gaps, we advocate for: (a) vision-based observations and low-level control to enhance generalization; (b) adaptive learning beyond static prompting; (c) effective planning and reasoning methods and models; (d) benchmarks that reflect real-world task complexity; (e) standardized evaluation based on task success; (f) aligning agent design with real-world deployment constraints. Together, our taxonomy and analysis establish a foundation for advancing ACU research toward general-purpose agents for robust and scalable computer use.

Related papers

From Perception to Action: An Interactive Benchmark for Vision Reasoning [51.11355591375073]
Causal Hierarchy of Actions and Interactions (CHAIN) benchmark designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints.<n> CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing.<n>Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.
arXiv Detail & Related papers (2026-02-24T15:33:02Z)
Agentic Reasoning for Large Language Models [122.81018455095999]
Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making.<n>Large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, but struggle in open-ended and dynamic environments.<n>Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction.
arXiv Detail & Related papers (2026-01-18T18:58:23Z)
Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems [9.388162021920206]
This survey presents the first comprehensive review of agentic AI in remote sensing.<n>We introduce a unified taxonomy distinguishing between single-agent copilots and multi-agent systems.<n>We review emerging benchmarks that move the evaluation from pixel-level accuracy to trajectory-aware reasoning correctness.
arXiv Detail & Related papers (2026-01-05T08:34:17Z)
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.57043903478257]
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations.<n>With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality.<n>This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development.
arXiv Detail & Related papers (2025-08-06T14:33:45Z)
Generic-to-Specific Reasoning and Learning for Scalable Ad Hoc Teamwork [10.462598319732187]
This paper advocates leveraging the complementary strengths of knowledge-based and data-driven methods for reasoning and learning for ad hoc teamwork.<n>For any given goal, our architecture enables each ad hoc agent to determine its actions through non-monotonic logical reasoning.<n>We experimentally evaluate our architecture's capabilities in VirtualHome, a realistic physics-based 3D simulation environment.
arXiv Detail & Related papers (2025-08-06T07:44:38Z)
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence [87.08051686357206]
Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static.<n>As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck.<n>This survey provides the first systematic and comprehensive review of self-evolving agents.
arXiv Detail & Related papers (2025-07-28T17:59:05Z)
OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth? [30.788287791669458]
OS-MAP is a benchmark for daily computer-using automation.<n>It organizes its 416 realistic tasks across 15 applications along two key dimensions.<n>It captures varying levels of required agent autonomy and generalization.
arXiv Detail & Related papers (2025-07-25T10:14:53Z)
Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories [18.129031749321058]
Large Language Model (LLM)-based agents are increasingly employed to automate complex software engineering tasks.<n>Despite their widespread adoption, the internal decision-making processes of these agents remain largely unexplored.<n>We present a large-scale empirical study of the thought-action-result trajectories of three state-of-the-art LLM-based agents.
arXiv Detail & Related papers (2025-06-23T16:34:52Z)
Deep Research Agents: A Systematic Examination And Roadmap [79.04813794804377]
Deep Research (DR) agents are designed to tackle complex, multi-turn informational research tasks.<n>In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute DR agents.
arXiv Detail & Related papers (2025-06-22T16:52:48Z)
Towards Pervasive Distributed Agentic Generative AI -- A State of The Art [0.0]
The rapid advancement of intelligent agents and Large Language Models (LLMs) is reshaping the pervasive computing field.<n>This survey outlines the architectural components of LLM agents and examines their deployment and evaluation across various scenarios.<n>It highlights state-of-the-art agent deployment strategies and applications, including local and distributed execution on resource-constrained devices.
arXiv Detail & Related papers (2025-06-16T10:15:06Z)
Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research [32.92036657863354]
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks.<n>However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison.<n>We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and abstraction framework that addresses these challenges.
arXiv Detail & Related papers (2025-05-30T08:46:23Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [54.52092001110694]
ThinkGeo is a benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions [51.96890647837277]
Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence.
arXiv Detail & Related papers (2025-04-07T21:01:25Z)
A Survey on (M)LLM-Based GUI Agents [62.57899977018417]
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction. Recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. This survey identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control.
arXiv Detail & Related papers (2025-03-27T17:58:31Z)
Large Language Model Agent: A Survey on Methodology, Applications and Challenges [88.3032929492409]
Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time.
arXiv Detail & Related papers (2025-03-27T12:50:17Z)
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds [12.464941027105306]
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact.<n>Recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation.<n>We present AIOPSLAB, a framework that deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents.
arXiv Detail & Related papers (2025-01-12T04:17:39Z)
GUI Agents: A Survey [129.94551809688377]
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction.<n>Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods.
arXiv Detail & Related papers (2024-12-18T04:48:28Z)
GUI Agents with Foundation Models: A Comprehensive Survey [52.991688542729385]
This survey consolidates recent research on (M)LLM-based GUI agents. We highlight key innovations in data, frameworks, and applications. We hope this paper will inspire further developments in the field of (M)LLM-based GUI agents.
arXiv Detail & Related papers (2024-11-07T17:28:10Z)
Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI) Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
AgentStudio: A Toolkit for Building General Virtual Agents [57.02375267926862]
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces. It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos. Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation.
arXiv Detail & Related papers (2024-03-26T17:54:15Z)
Knowledge-based Reasoning and Learning under Partial Observability in Ad Hoc Teamwork [4.454557728745761]
This paper introduces an architecture that determines an ad hoc agent's behavior based on non-monotonic logical reasoning. It supports online selection, adaptation, and learning of the models that predict the other agents' behavior. We show that the performance of our architecture is comparable or better than state of the art data-driven baselines in both simple and complex scenarios.
arXiv Detail & Related papers (2023-06-01T15:21:27Z)
Toward a Reasoning and Learning Architecture for Ad Hoc Teamwork [4.454557728745761]
We present an architecture for ad hoc teamwork, which refers to collaboration in a team of agents without prior coordination. Our architecture combines the principles of knowledge-based and data-driven reasoning and learning. We use the benchmark simulated multiagent collaboration domain Fort Attack to demonstrate that our architecture supports adaptation to unforeseen changes.
arXiv Detail & Related papers (2022-08-24T13:57:33Z)
Realistic simulation of users for IT systems in cyber ranges [63.20765930558542]
We instrument each machine by means of an external agent to generate user activity. This agent combines both deterministic and deep learning based methods to adapt to different environment. We also propose conditional text generation models to facilitate the creation of conversations and documents.
arXiv Detail & Related papers (2021-11-23T10:53:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.