Related papers: RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

URL: http://arxiv.org/abs/2504.10445v1
Date: Mon, 14 Apr 2025 17:36:46 GMT
Title: RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Authors: Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu,
Abstract summary: RealWebAssist is a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web.<n>Each user instructs a web-based assistant to perform a series of tasks on multiple websites.<n>A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements.
Score: 8.044364097415007
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.

Related papers

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence [109.32705135051486]
Embodied Web Agents is a novel paradigm for AI agents that fluidly bridge the embodiment and web-scale reasoning.<n>We release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks.<n>Results reveal significant performance gaps between state-of-the-art AI systems and human capabilities.
arXiv Detail & Related papers (2025-06-18T17:58:17Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
PAFFA: Premeditated Actions For Fast Agents [19.576180667174366]
We introduce PAFFA, a method that makes LLMs faster and more accurate in completing tasks on the internet using a novel inference-time technique. PAFFA drastically reduces inference time tokens by 87% while maintaining robust performance. Unravel's ability to update its action library based on explorations allows generalization and adaptation to unseen websites.
arXiv Detail & Related papers (2024-12-10T22:51:31Z)
Large Language Models Empowered Personalized Web Agents [54.944908837494374]
Web agents have evolved from traditional agents to Large Language Models (LLMs)-based Web agents. We first formulate the task of LLM-empowered personalized Web agents, which integrate personalized data and user instructions. We propose a Personalized User Memory-enhanced Alignment (PUMA) framework to adapt LLMs to the personalized Web agent task.
arXiv Detail & Related papers (2024-10-22T17:54:45Z)
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research. We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z)
MMInA: Benchmarking Multihop Multimodal Internet Agents [36.173995299002]
We present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks.<n>Our data includes 1,050 human-written tasks covering various domains such as shopping and travel.<n>We propose a novel protocol for evaluating an agent's progress in completing multihop tasks.
arXiv Detail & Related papers (2024-04-15T17:59:50Z)
User Modeling Challenges in Interactive AI Assistant Systems [3.1204913702660475]
Interactive Artificial Intelligent(AI) assistant systems are designed to offer timely guidance to help human users to complete a variety tasks. One of the remaining challenges is to understand user's mental states during the task for more personalized guidance. In this work, we analyze users' mental states during task executions and investigate the capabilities and challenges for large language models to interpret user profiles for more personalized user guidance.
arXiv Detail & Related papers (2024-03-29T11:54:13Z)
On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment. We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
Task Relation-aware Continual User Representation Learning [26.514449669395297]
Previous efforts in user modeling mainly focus on learning a task-specific user representation that is designed for a single task. Recent studies introduce the concept of universal user representation, which is a more generalized representation of a user relevant to a variety of tasks. Despite their effectiveness, existing approaches for learning universal user representations are impractical in real-world applications. We propose a novel continual user representation learning method, called TERACON, whose learning capability is not limited as the number of learned tasks increases.
arXiv Detail & Related papers (2023-06-01T08:10:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.