Related papers: WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

URL: http://arxiv.org/abs/2601.02439v2
Date: Wed, 07 Jan 2026 11:21:44 GMT
Title: WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Authors: Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead,
Abstract summary: WebGym is the largest-to-date open-source environment for training realistic visual web agents.<n>WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites.
Score: 35.99528846296261
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

Related papers

TimeWarp: Evaluating Web Agents by Revisiting the Past [7.017865728670461]
We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout.<n>Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories.<n>We propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions.
arXiv Detail & Related papers (2026-03-05T08:43:06Z)
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z)
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning [36.47273215142354]
WebAgent-R1 is an end-to-end multi-turn reinforcement learning framework for training web agents.<n>Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9%.<n>In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks.
arXiv Detail & Related papers (2025-05-22T09:07:43Z)
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.<n>AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z)
Agent Workflow Memory [71.81385627556398]
We introduce Agent Memory, a method for inducing commonly reused routines. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate. Online AWM robustly generalizes in cross-task, website, and domain evaluations.
arXiv Detail & Related papers (2024-09-11T17:21:00Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
Learning Synthetic Environments for Reinforcement Learning with Evolution Strategies [34.13101380723782]
This work explores learning agent-agnostic synthetic environments (SEs) for Reinforcement Learning. SEs act as a proxy for target environments and allow agents to be trained more efficiently than when directly trained on the target environment. We show that our method is capable of learning SEs for two discrete-action-space tasks that allow us to train agents more robustly and with up to 60% fewer steps.
arXiv Detail & Related papers (2021-01-24T14:16:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.