Fugu-MT 論文翻訳(概要): Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

論文の概要: Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

arxiv url: http://arxiv.org/abs/2605.30000v2
Date: Sun, 31 May 2026 12:00:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 18:24:16.825988
Title: Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Title（参考訳）: Cookie-Bench: Webジェネレーションのための継続的オンスクリーンキーインタラクション評価
Authors: Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu,
Abstract要約: textbfdatanameは11ドメイン、54リーフ、1000キューのWebDevベンチマークで、静的表現とインタラクティブアプリケーションの両方にまたがる。 textbfframenameはフラヴェルのメタ認知モニタリングに基づいており、3段階にわたる判断から証拠の蓄積を分離している。
参考スコア（独自算出の注目度）: 24.920344869492066
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/
Abstract（参考訳）: フロントエンドのWebコードは、すべてのフロンティアLDMリリースのコア製品サーフェスになっていますが、これらのインタラクティブなアプリケーションを開発速度で評価することは、Arenaのような人手によるリーダーボードがスケールしないため、コストがかかります。既存の自動プロキシは通常、リファレンス実装、テストスイート、あるいは厳格なチェックリストに依存し、人間のレビュアーがライブセッションで実行する理由付けの合成を見逃しがちである。 2つのアーティファクトを通じて、参照不要、自律駆動、および全体的推論を同時に行う新しい評価体制を具体化し、インスタンス化する。 \textbf{\dataname}は、静的表現と対話型アプリケーションタスクの両方にまたがる11のドメイン、54のリーフ、1000のクエリのWebDevベンチマークであり、3つの困難層と3つのターゲット言語グループでバランスが取れ、循環したプロンプトからのリコールに抵抗するブリーフが書き直されている。静的知覚は受動的観察から最初の印象を形成する; エージェント駆動インタラクションは、連続的なスクリーンビデオ、オーディオ、ステップごとのスクリーンショットをキャプチャしながら、自律的にアプリケーションを探索する; ダイナミックスコーリングは、エビデンスチェーンが完了した後にのみ、構造化された失敗帰属を判断する。 \datanameでは、Shaframenameは専門家の人間格付けと密接に一致し、対話型Web生成において13のフロンティア LLM でかなりのヘッドルームを誇示している。 \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

論文の概要: Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

関連論文リスト