Fugu-MT 論文翻訳(概要): SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications

論文の概要: SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications

arxiv url: http://arxiv.org/abs/2605.09610v1
Date: Sun, 10 May 2026 15:47:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.330986
Title: SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications
Title（参考訳）: SmartEval: LLM生成のスマートコントラクトを自然言語仕様から評価するためのベンチマーク
Authors: Abhinav Goel, Agostino Capponi, Alfio Gliozzo, Chaitya Shah,
Abstract要約: 大規模言語モデル(LLM)によって生成されるSolidityスマートコントラクトの品質を体系的に評価するベンチマークであるSmartEvalを紹介する。 SmartEvalは、FSMSCGデータセットから引き出された専門家が記述したゼロトルース実装と組み合わせて、9000の生成されたコントラクトのコーパスを提供する。ベンチマークの信頼性を検証するために,3つの独立した実験研究を行った。
参考スコア（独自算出の注目度）: 5.027278762864141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs' literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.
Abstract（参考訳）: SmartEvalは,大規模言語モデル(LLM)が生成するSolidityスマートコントラクトの品質を,自然言語仕様から体系的に評価するベンチマークである。 SmartEvalは、FSMSCGデータセットから引き出された専門家による基盤構造実装と組み合わせた9000の生成された契約コーパス、機能完全性、可変忠実性、状態機械の正確性、ビジネス論理的忠実性、コード品質を含む5次元評価ルーブル、再現可能な生成・評価パイプラインを提供する。ベンチマークの信頼性を検証するために,各パイプラインコンポーネントの寄与を分離する5条件アブレーション研究 (N=300) ,コロンビア大学の3人のPhD研究者による人的専門家による評価により,専門家判定と0.34点以内の自動評価が確認された,Slither静的アナライザによる外部セキュリティ分析により,LLM監査者と非LLMルールベースツールとの79.4%の一致が確認された,3つの独立した実証研究を行った。 9000個の生成された契約の体系的分析により、特性的障害モード(論理的省略率35.3%、状態遷移エラー23.4%、複雑性駆動劣化)が明らかとなり、LLMのリテラル仕様追従動作に起因する、ゼロトラル実装上で生成された契約の+8.29の合成スコアの利点を定量化する。 SmartEvalは、LLMスマートコントラクト合成の品質に関する実証研究のための再現可能で検証された基盤を確立し、すべてのデータ、評価コード、生成されたコントラクトを公開している。

論文の概要: SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications

関連論文リスト