Fugu-MT 論文翻訳(概要): WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

論文の概要: WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

arxiv url: http://arxiv.org/abs/2604.18224v1
Date: Mon, 20 Apr 2026 13:09:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.885563
Title: WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
Title（参考訳）: WebCompass: コード言語モデルのマルチモーダルWebコーディング評価を目指す
Authors: Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu,
Abstract要約: 我々はWebエンジニアリング能力のライフサイクルを統一的に評価するマルチモーダル・ベンチマークであるWebを紹介した。 Webは3つの入力モード(テキスト、画像)と3つのタスクタイプ(生成、編集、修復)にまたがる評価のために,チェックリストに誘導されたLDM-as-a-Judgeプロトコルを採用し,実際のブラウザで自動生成されたWebサイトを生成するための新しいエージェント-as-a-Judgeパラダイムを提案する。
参考スコア（独自算出の注目度）: 40.87133775066985
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.
Abstract（参考訳）: 大規模言語モデルは、エンド・ツー・エンドのWebコーディングが可能なインタラクティブなコーディングエージェントへと急速に進化しているが、既存のベンチマークでは、通常、静的な正確性メトリクスを備えたテキストコンディショニング生成と、視覚的忠実さ、インタラクション品質、コードベースレベルの推論など、この機能の狭いスライスしか評価されていない。本稿では,Webエンジニアリング機能の統合ライフサイクル評価を提供するマルチモーダルベンチマークであるWebCompassを紹介する。 WebCompassは3つの入力モダリティ(テキスト、画像、ビデオ)と3つのタスクタイプ(生成、編集、修復)にまたがって、プロのワークフローを反映する7つのタスクカテゴリを生成する。多段階のヒューマン・イン・ザ・ループパイプラインを通じて、15の世代ドメイン、16の編集操作タイプ、11の修復欠陥タイプをカバーするインスタンスをキュレートし、それぞれがEasy/Medium/Hardレベルに注釈付けされている。評価のために、チェックリストに誘導されたLCM-as-a-Judgeプロトコルを編集・修復するために採用し、実際のブラウザで生成されたWebサイトを自律的に実行し、モデルコンテキストプロトコル(MCP)を介してインタラクティブな振る舞いを探索し、ターゲットとするテストケースを反復的に合成し、人間の受け入れテストの密接な近似を行う新しいエージェント・アズ・ア・Judgeパラダイムを提案する。我々は,(1)クローズドソースモデルとオープンソースモデルの代表的モデルの評価を行い,(1)クローズドソースモデルはかなり強固でバランスの取れたままであり,(2) 編集と修復は,対話性の向上を保ったままの修復が困難であること,(3) 美学は特にオープンソースモデルにおいて最も永続的なボトルネックであること,(4) フレームワークの選択が結果に重大な影響を与えること,(4) Vue は一貫して困難であり,React と Vanilla/HTML はタスクタイプに依存している。

論文の概要: WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

関連論文リスト