Related papers: WALT: Web Agents that Learn Tools

WALT: Web Agents that Learn Tools

URL: http://arxiv.org/abs/2510.01524v1
Date: Wed, 01 Oct 2025 23:41:47 GMT
Title: WALT: Web Agents that Learn Tools
Authors: Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu,
Abstract summary: WALT is a framework that reverse-engineers latent website functionality into reusable invocable tools.<n>Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites.<n>On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning.
Score: 66.73502484310121
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

Related papers

Nested Browser-Use Learning for Agentic Information Seeking [60.775556172513014]
Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching.<n>We propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure.
arXiv Detail & Related papers (2025-12-29T17:59:14Z)
DeepAgent: A General Reasoning Agent with Scalable Toolsets [111.6384541877723]
DeepAgent is an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution.<n>To address the challenges of long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories.<n>We develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens.
arXiv Detail & Related papers (2025-10-24T16:24:01Z)
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions [48.194688161526756]
BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions.<n>We introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities.
arXiv Detail & Related papers (2025-10-12T15:43:37Z)
Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree [8.511846002129522]
We show that adversaries can embed universal adversarial triggers in webpage HTML to hijack agent behavior.<n>Our system demonstrates high success rates across real websites in both targeted and general attacks.
arXiv Detail & Related papers (2025-07-20T03:10:13Z)
WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms [52.942566473658054]
We enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory.<n>This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method.
arXiv Detail & Related papers (2025-04-16T05:41:20Z)
WebNav: An Intelligent Agent for Voice-Controlled Web Navigation [0.0]
WebNav is a novel agent for multi-modal web navigation.<n>System combines vision-based context from screenshots with a dynamic DOM-labeling browser extension.
arXiv Detail & Related papers (2025-03-18T02:33:27Z)
Steward: Natural Language Web Automation [19.301371856154965]
Large language models (LLMs) have demonstrated exceptional capabilities in serving as the foundation for AI assistants. We introduce Steward, a novel LLM-powered web automation tool designed to serve as a cost-effective, scalable, end-to-end solution for automating web interactions. We discuss various design and implementation challenges, including state representation, action sequence selection, system responsiveness, detecting task completion, and caching implementation.
arXiv Detail & Related papers (2024-09-23T18:06:32Z)
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images.<n>By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data.<n>Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.