Related papers: TimeWarp: Evaluating Web Agents by Revisiting the Past

TimeWarp: Evaluating Web Agents by Revisiting the Past

URL: http://arxiv.org/abs/2603.04949v1
Date: Thu, 05 Mar 2026 08:43:06 GMT
Title: TimeWarp: Evaluating Web Agents by Revisiting the Past
Authors: Md Farhan Ishmam, Kenneth Marino,
Abstract summary: We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout.<n>Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories.<n>We propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions.
Score: 7.017865728670461
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.

Related papers

OpAgent: Operator Agent for Web Navigation [23.928869500029432]
We develop an online interaction environment and fine-tune the Vision-Language Model (VLM) using a specialized RL pipeline.<n>We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment and a Rule-based Decision Tree (RDT) for progress reward.<n> Notably, our RL-enhanced model achieves a 38.1% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines.
arXiv Detail & Related papers (2026-02-14T02:33:55Z)
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks [35.99528846296261]
WebGym is the largest-to-date open-source environment for training realistic visual web agents.<n>WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites.
arXiv Detail & Related papers (2026-01-05T09:35:11Z)
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z)
WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance [29.57207599604568]
WebCoach is a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory.<n>WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories.<n> Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents.
arXiv Detail & Related papers (2025-11-17T05:38:50Z)
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks [51.803138848305814]
We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks.<n>We identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs.<n>Our findings surface both the diversity and brittleness of current web agents.
arXiv Detail & Related papers (2025-10-02T15:22:21Z)
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning [51.14454312533818]
WebGen-Agent is a novel website-generation agent that leverages comprehensive and multi-level visual feedback.<n>We introduce textitStep-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent.<n>WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system.
arXiv Detail & Related papers (2025-09-26T17:59:51Z)
WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [78.55946306325914]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z)
WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms [52.942566473658054]
We enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory.<n>This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method.
arXiv Detail & Related papers (2025-04-16T05:41:20Z)
R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z)
MMInA: Benchmarking Multihop Multimodal Internet Agents [36.173995299002]
We present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks.<n>Our data includes 1,050 human-written tasks covering various domains such as shopping and travel.<n>We propose a novel protocol for evaluating an agent's progress in completing multihop tasks.
arXiv Detail & Related papers (2024-04-15T17:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.