Fugu-MT 論文翻訳(概要): VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

論文の概要: VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

arxiv url: http://arxiv.org/abs/2605.26144v1
Date: Fri, 22 May 2026 20:29:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.195678
Title: VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
Title（参考訳）: VISTA: Visual Spec-to-Web-App コーディングエージェントのエンドツーエンドベンチマーク
Authors: JunJia Guo, Yuhang Yao, Jiawei, Zhou, Jingdi Chen,
Abstract要約: VISTAは、LLMベースのエージェントのエンドツーエンドのWebアプリケーション生成機能を評価するためのベンチマークである。視覚的/構造的忠実度とスタック制約の2つの軸に沿って変化する5つのプロンプト情報条件を定義する。ベンチマークの各ページは、インタラクティブなUIコンポーネントと約3つの視覚的アンカーポイントで手動で注釈付けされる。
参考スコア（独自算出の注目度）: 25.141059096863255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.
Abstract（参考訳）: LLMエージェントのエンドツーエンドWebアプリケーション生成能力を評価するベンチマークであるVISTA(VIsual Spec-to-App Benchmark)を提案する。アルゴリズム的なタスクにフォーカスする以前のコード生成ベンチマークとは異なり、VISTAは、エージェントが不特定入力から機能的で視覚的に一貫性のあるアプリケーションを生成しなければならない、現実的なUI中心の開発をターゲットにしている。 1)フリースタック選択によるテキスト,(2)フリースタック選択による参照スクリーンショット付きテキスト,(3)フリースタック選択による参照スクリーンショット付きテキスト,(4)単一スタックによるスクリーンショット付きテキストとプルーニングフィグマ構造,(5)フリースタック選択によるスクリーンショット付きテキストとプルーニングフィグマ構造。堅牢な評価を可能にするため、ベンチマークの各ページはインタラクティブなUIコンポーネントと3つの視覚的アンカーポイントで手動で注釈付けされ、オープンなコード生成設定でPlaywrightのようなスクリプトベースのテストツールの既知の制限に対処する。評価はDOMベースの参照マッチング、振る舞い固有のブラウザテスト、CLIPベースの視覚的類似性を組み合わせて、構造的アライメント、行動完全性、全体的な視覚的忠実度を共同で測定する。 VISTAを用いて、2つのモデルファミリーと2つのハーネスから引き出された4つのエージェントシステムを評価し、視覚的忠実度と機能的正当性は、入力条件とエージェントの両方で部分的に分離され、エージェント編集スタイルは急変するが、タスク品質にほぼ直交する。 VISTAはエージェントベースのソフトウェア工学研究を進めるための厳密で再現可能な基盤を確立する。

論文の概要: VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

関連論文リスト