Fugu-MT 論文翻訳(概要): WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

論文の概要: WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

arxiv url: http://arxiv.org/abs/2603.25226v1
Date: Thu, 26 Mar 2026 09:27:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.213984
Title: WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
Title（参考訳）: WebTestBench: エンド・ツー・エンドのWebテストに向けたコンピュータ・ユース・エージェントの評価
Authors: Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai,
Abstract要約: エンドツーエンドの自動Webテストを評価するベンチマークであるWebTestBenchを紹介します。テストプロセスを2つのカスケードサブタスク、チェックリストの生成と欠陥検出に分解し、WebTesterを提案する。以上の結果から,現在のコンピュータ利用エージェント能力と産業レベルの展開要求との間に大きなギャップがあることが判明した。
参考スコア（独自算出の注目度）: 57.7131457251794
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.
Abstract（参考訳）: LLM(Large Language Models)の出現はプログラミングのパラダイムシフトを引き起こし、ユーザーが完全なプロジェクトを構築し、自然言語命令を使ってコンピュータを制御できる「バイブコーディング」を生み出した。このパラダイムは、自動Webページ開発を駆動するが、Web機能が確実に実装されているかどうかを自動的に検証する方法に関する新しい要件を導入する。既存の作業は、静的な視覚的類似性や、オープンな環境において彼らのユーティリティを制約する事前定義されたチェックリストに依存して、適応に苦労する。さらに、彼らはソフトウェア品質の重要な側面、すなわち遅れた論理的制約を見落としています。これらのギャップに対処するため、エンドツーエンドの自動Webテストを評価するベンチマークであるWebTestBenchを紹介します。 WebTestBenchは、様々なWebアプリケーションカテゴリにわたる包括的なディメンションを含んでいる。テストプロセスを2つのケース化されたサブタスク、チェックリストの生成と欠陥検出に分解し、このタスクのベースラインフレームワークであるWebTesterを提案する。 WebTesterで人気のあるLLMを評価すると、テストの完全性不足、検出ボトルネック、長時間水平相互作用の信頼性の欠如など、深刻な課題が明らかになる。これらの結果は、現在のコンピュータ利用エージェント能力と産業レベルの展開要求との間に大きなギャップがあることを示唆している。 WebTestBenchは、エンドツーエンドの自動化Webテストを進める上で、貴重な洞察とガイダンスを提供することを期待しています。私たちのデータセットとコードはhttps://github.com/friedrichor/WebTestBench.orgから入手可能です。

論文の概要: WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

関連論文リスト