Fugu-MT 論文翻訳(概要): WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

論文の概要: WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

arxiv url: http://arxiv.org/abs/2510.18560v1
Date: Tue, 21 Oct 2025 12:16:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.462838
Title: WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Title（参考訳）: WebDevJudge:(M)LLMをWeb開発品質の批判として評価する
Authors: Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Yangqiu Song, Lihui Chen, Han Hu,
Abstract要約: 我々は、Web開発におけるLCM-as-a-judgeのパフォーマンスを評価するための体系的なベンチマークであるWebDevJudgeを紹介する。 WebDevJudgeは、構造化およびクエリグラウンドのルーリックで注釈付けされた、ペア化されたWeb実装よりも人間の好みラベルで構成されている。詳細な分析によると、このギャップは、機能的同値性認識の失敗、タスク実現可能性の検証、バイアス軽減など、基本的なモデル上の制限に由来する。
参考スコア（独自算出の注目度）: 62.43165871914528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios. Code and data are available at https://github.com/lcy2723/WebDevJudge.
Abstract（参考訳）: LLM-as-a-judgeのパラダイムは、人間の評価に代わるスケーラブルで効率的な代替品として登場し、明確に定義されたタスク上での強力なパフォーマンスを示している。しかし、動的環境や複雑な相互作用を伴うオープンエンドタスクの信頼性は未解明のままである。このギャップを埋めるために、Web開発におけるLCM-as-a-judgeのパフォーマンスを評価するための体系的なベンチマークであるWebDevJudgeを紹介し、静的な観察に基づく非インタラクティブな評価と動的Web環境による継続的なインタラクティブな評価の両方をサポートする。 WebDevJudgeは、ペア化されたWeb実装よりも人間の好みラベルで構成されており、高品質な土台真理を保証するために、構造化されたクエリグラウンドのルーリックで注釈付けされている。このベンチマークを用いて, LLM, MLLM, エージェントワークフローなど, さまざまな評価指標を総合的に評価する。我々は、異なるパラダイムとガイダンスメカニズムの影響を体系的に調査する。我々の実験は、LLM審査員と人間の専門家の間に大きなギャップがあることを明らかにした。詳細な分析によると、このギャップは機能的等価性を認識できないこと、タスクの実現可能性を検証すること、バイアスを緩和することなど、基本的なモデル制限に由来する。全体として、WebDevJudgeはLLM-as-a-judgeに重大な課題を示し、複雑なシナリオのためのより信頼性が高く有能な自動評価器の開発に向けた将来の研究をガイドするための洞察を提供する。コードとデータはhttps://github.com/lcy2723/WebDevJudgeで公開されている。

論文の概要: WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

関連論文リスト