WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
- URL: http://arxiv.org/abs/2407.05291v1
- Date: Sun, 7 Jul 2024 07:15:49 GMT
- Title: WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
- Authors: Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin,
- Abstract summary: Large language models (LLMs) can mimic human-like intelligence.
WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
- Score: 85.95607119635102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena/tree/workarena-plus-plus.
Related papers
- MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics? [33.573056018368504]
This study introduces the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark.
We identify four essential capabilities perception, task planning, visual reasoning, and safety measurement that MLLMs must possess to qualify as the robot's central processing unit.
Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots.
arXiv Detail & Related papers (2024-06-28T07:09:06Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - Large Language Model based Multi-Agents: A Survey of Progress and Challenges [44.92286030322281]
Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks.
Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation.
arXiv Detail & Related papers (2024-01-21T23:36:14Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI)
We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents.
We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z) - TPTU: Large Language Model-based AI Agents for Task Planning and Tool
Usage [28.554981886052953]
Large Language Models (LLMs) have emerged as powerful tools for various real-world applications.
Despite their prowess, intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks.
This paper proposes a structured framework tailored for LLM-based AI Agents.
arXiv Detail & Related papers (2023-08-07T09:22:03Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - Robots That Ask For Help: Uncertainty Alignment for Large Language Model
Planners [85.03486419424647]
KnowNo is a framework for measuring and aligning the uncertainty of large language models.
KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion.
arXiv Detail & Related papers (2023-07-04T21:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.