Fugu-MT 論文翻訳(概要): ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming

論文の概要: ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming

arxiv url: http://arxiv.org/abs/2602.00757v1
Date: Sat, 31 Jan 2026 14:44:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 14:44:11.864003
Title: ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming
Title（参考訳）: ScratchEval : ブロック型プログラミングにおけるLLMのマルチモーダル評価フレームワーク
Authors: Yuan Si, Simeng Han, Daming Li, Hanyuan Shi, Jialu Zhang,
Abstract要約: スクラッチプログラムは、深くネストした非線形構造、イベント駆動のスプライト、およびコードとマルチメディアアセット間の密結合を示す。 ScratchEvalは、ScratchプログラムのLLMベースの修復を評価するために設計された最初の実行可能ベンチマークである。このベンチマークは、自動プロジェクトマイニングとトリガーアウトカムセマンティクスのエキスパートバリデーションを組み合わせた、ヒューマン・イン・ザ・ループパイプラインを通じて構築されている。
参考スコア（独自算出の注目度）: 3.935975887408409
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LLMs have achieved strong performance on text-based programming tasks, yet they remain unreliable for block-based languages such as Scratch. Scratch programs exhibit deeply nested, non-linear structures, event-driven concurrency across multiple sprites, and tight coupling between code and multimedia assets, properties that differ fundamentally from textual code. As a result, LLMs often misinterpret Scratch semantics and generate large, invasive edits that are syntactically valid but semantically incorrect when repairing buggy programs. We introduce ScratchEval, the first executable benchmark designed to evaluate LLM-based repair for Scratch programs, covering program understanding, debugging, analysis, and repair. The benchmark contains 100 curated Scratch projects from the public repository, selected for structural and semantic complexity. Each project is paired with executable test suites, bug descriptions with corresponding fixes, block-level edit constraints defining minimal semantically correct repairs, and required multimedia assets. The benchmark is constructed through a human-in-the-loop pipeline combining automated project mining with expert validation of trigger-outcome semantics and representative bug patterns, with emphasis on event ordering, concurrency, and state management. To enable rigorous and reproducible evaluation, we propose a three-layer executable protocol measuring functional correctness via VM-level execution, repair quality using block-level edit distance and behavioral trajectory comparisons, and explanation quality via structured rubrics assessing alignment between model reasoning and generated patches. Using ScratchEval, we study domain-specific fine-tuning, training data effectiveness, and model generalization to unseen bug types. ScratchEval provides a reproducible foundation for evaluating and post-training LLMs on block-based programming tasks.
Abstract（参考訳）: LLMはテキストベースのプログラミングタスクでは高いパフォーマンスを達成しているが、Scratchのようなブロックベースの言語では信頼性が低い。スクラッチプログラムは、深くネストされた非線形構造、複数のスプライトをまたいだイベント駆動並行性、そして、コードとマルチメディアアセットの密結合、そして基本的にテキストコードと異なる特性を示す。結果として、LLMはスクラッチのセマンティクスを誤って解釈し、バギープログラムを修復する際には、構文的に妥当だが意味的に不正確である大規模な侵入的な編集を生成する。 ScratchEvalは、ScratchプログラムのLCMベースの修復を評価するために設計された最初の実行可能なベンチマークであり、プログラムの理解、デバッグ、解析、修復をカバーしている。ベンチマークにはパブリックリポジトリから100のキュレートされたScratchプロジェクトが含まれており、構造的およびセマンティックな複雑さのために選択されている。各プロジェクトには、実行可能なテストスイート、対応する修正を伴うバグ記述、最小限の意味論的修正を定義するブロックレベルの編集制約、必要なマルチメディアアセットが組み合わされている。このベンチマークは、自動プロジェクトマイニングとトリガーアウトカムセマンティクスと代表的なバグパターンのエキスパート検証を組み合わせた、ヒューマン・イン・ザ・ループパイプラインを通じて構築され、イベントの順序付け、並行処理、状態管理に重点を置いている。厳密かつ再現可能な評価を可能にするために,VMレベルの実行による機能的正当性,ブロックレベルの編集距離と動作軌跡比較を用いた修復品質,モデル推論と生成されたパッチの整合性を評価する構造化ルーリックによる説明品質,3層実行可能プロトコルを提案する。 ScratchEvalを用いて、未確認のバグタイプに対するドメイン固有の微調整、トレーニングデータの有効性、モデル一般化について検討する。 ScratchEvalは、ブロックベースのプログラミングタスクでLLMを評価し後トレーニングするための再現可能な基盤を提供する。

論文の概要: ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming

関連論文リスト