Fugu-MT 論文翻訳(概要): From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

論文の概要: From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

arxiv url: http://arxiv.org/abs/2604.22601v1
Date: Fri, 24 Apr 2026 14:28:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.503888
Title: From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
Title（参考訳）: 自然言語から検証コードへ:ダニーベース形式検証によるAIによる問題解決に向けて
Authors: Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, Md Rayhanur Rahman,
Abstract要約: 大規模な言語モデルは、自動化されたソフトウェア工学における約束を示すが、その正しさの保証は、誤ったコードや幻覚的なコードによってしばしば損なわれる。 NaturalLanguage2VerifiedCodeデータセット:60の複雑なアルゴリズム問題の集合を提供する。 7個のオープンウェイト LLM でランダムに選択された11個の問題集合をタイレッドプロンプト戦略を用いて評価した。以上の結果から,コンテキストレスなプロンプトがほぼユニバーサルの失敗につながる一方で,構造的アンカーと反復的自己修復が劇的なパフォーマンスの転換を促進することが示唆された。
参考スコア（独自算出の注目度）: 0.30915521808748864
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自動化されたソフトウェア工学における約束を示すが、その正しさの保証は、誤ったコードや幻覚的なコードによってしばしば損なわれる。モデルの正直性を強制するためには、形式的検証では、数学的検証によって正しいことが証明された形式的な仕様と並行して実装ロジックを合成する必要がある。しかし、非公式な自然言語から正確な形式仕様への移行は依然として困難な課題である。我々の研究は、NaturalLanguage2VerifiedCode (NL2VC)-60データセットを提供することによって、この問題に対処しています。本研究では,7つのオープンウェイトLCMに対してランダムに選択された11の問題を,コンテキストレスプロンプト,構造アンカーを提供するシグネチャプロンプト,Dafny検証器からの反復的フィードバックを利用した自己修復プロンプトを用いて評価する。モデルが明快な仕様で検証を満足する空洞検証に対処するために、uDebugプラットフォームを統合し、機能検証を確実にする。以上の結果から,文脈のないプロンプトがほぼユニバーサルの失敗につながる一方で,構造的シグネチャと反復的自己修復が劇的なパフォーマンスの転換を促進することが示唆された。具体的には、Gemma 4-31Bは90.91\%の検証成功率を達成し、GPT-OSS 120Bは署名誘導フィードバックで0から81.82\%に上昇した。これらの結果は、複雑なアノテーションを合成し、高可用性ソフトウェア開発を促進する効果的な見習いとして機能するオープンウェイトLLMに対して、フォーマルな検証が実現可能であることを示唆している。

論文の概要: From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

関連論文リスト