Fugu-MT 論文翻訳(概要): MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

論文の概要: MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

arxiv url: http://arxiv.org/abs/2603.09652v1
Date: Tue, 10 Mar 2026 13:30:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.335277
Title: MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
Title（参考訳）: MiniAppBench: LLMベースのアシスタントにおけるテキストからインタラクティブなHTML応答へのシフトを評価する
Authors: Zuhao Zhang, Chengyue Yu, Yuante Li, Chenyi Zhuang, Linjian Mo, Shuai Li,
Abstract要約: MiniAppsは動的でインタラクティブなHTMLベースのアプリケーションで、現実世界の原則に準拠している。既存のベンチマークは主にアルゴリズムの正確性や静的なレイアウト再構築に焦点を当てている。原理駆動でインタラクティブなアプリケーション生成を評価するために設計された,最初の包括的なベンチマークであるMiniAppBenchを紹介する。また,エージェント評価フレームワークであるMiniAppEvalを提案する。
参考スコア（独自算出の注目度）: 15.81416663487443
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in github.com/MiniAppBench.
Abstract（参考訳）: コード生成におけるLarge Language Models(LLM)の急速な進歩により、静的テキスト応答から動的でインタラクティブなHTMLベースのアプリケーションへと、人間とAIのインタラクションが進化している。これらのアプリケーションは、ビジュアルインターフェースをレンダリングするだけでなく、現実世界の原則に準拠したカスタマイズされたインタラクションロジックを構築する必要がある。しかし、既存のベンチマークは主にアルゴリズムの正確性や静的なレイアウトの再構築に重点を置いており、この新しいパラダイムに必要な能力を捉えていない。このギャップに対処するために、原理駆動のインタラクティブなアプリケーション生成を評価するために設計された最初の包括的なベンチマークであるMiniAppBenchを紹介します。 10M以上の世代を持つ現実世界のアプリケーションから生まれたMiniAppBenchは、6つのドメイン(ゲーム、サイエンス、ツールなど)にわたる500のタスクを蒸留する。さらに,単一の真実が存在しないオープンエンドインタラクションの評価に挑戦するため,エージェント評価フレームワークであるMiniAppEvalを提案する。ブラウザの自動化を活用して、インテンション、静的、動的という3つの次元にわたるアプリケーションを体系的に評価する、人間ライクな探索テストを実行する。我々の実験によると、現在のLLMは高品質のMiniAppsを生成する上で大きな課題に直面しており、MiniAppEvalは人間の判断と高い整合性を示し、将来の研究の信頼できる標準を確立している。私たちのコードはgithub.com/MiniAppBenchで利用可能です。

論文の概要: MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

関連論文リスト