Fugu-MT 論文翻訳(概要): Inferring Code Correctness from Specification

論文の概要: Inferring Code Correctness from Specification

arxiv url: http://arxiv.org/abs/2605.29822v1
Date: Thu, 28 May 2026 12:04:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.219463
Title: Inferring Code Correctness from Specification
Title（参考訳）: 仕様からコードの正しさを推測する
Authors: Tambon Florian, Papadakis Mike,
Abstract要約: 大規模言語モデル(LLM)は現代のソフトウェア開発に不可欠なものとなり、大規模に自動コード生成を可能にしている。提案するTRAILS(Targeted Reasoning Agreement via Inputs and Specifications)は,コンクリート(インプット,アウトプット)ペアによるLCM推論を基礎とする手法である。 TRAILSをLiveCodeBenchとCoCoClaNeLの2つのデータセット(Qwen3Coder-30B、Devstral-Small-24B、Olmo3.1-Instruct)で評価し、HoarePromptとZero-Shot Chain-of-Thoughtベースラインと比較した。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge. Existing approaches either rely on dynamic consensus across multiple code candidates - making them costly and difficult to scale - or on static reasoning that is susceptible to dynamic bugs and order bias. In this paper, we propose TRAILS~ (Targeted Reasoning Agreement via Inputs and Specifications), an approach that grounds LLM reasoning with concrete (input, output) pairs. TRAILS~ first generates diverse test inputs via category partitioning based on the specification, then executes them against the candidate code and prompts LLMs to assess whether the resulting input-output pairs conform to the specification - without ever reasoning over the code itself. Scores are aggregated across inputs, to determines whether the program is likely correct. We evaluate TRAILS~ on two datasets, LiveCodeBench and CoCoClaNeL, across three LLMs (Qwen3Coder-30B, Devstral-Small-24B, and Olmo3.1-Instruct), comparing against HoarePrompt and a Zero-Shot Chain-of-Thought baseline. TRAILS~ improves Matthew Correlation Coefficient by up to 39\% relative to Zero-Shot COT and consistently outperforms HoarePrompt. Beyond accuracy, TRAILS~ demonstrates greater stability across seeded runs, reducing sensitivity to LLM non-determinism, and assigns correct labels to a larger set of unique code samples than competing approaches.
Abstract（参考訳）: 大規模言語モデル(LLM)は現代のソフトウェア開発に不可欠なものとなり、大規模に自動コード生成を可能にしている。しかし、LLM生成コードの正確性を検証することは、批判的であり、ほとんど未解決の課題である。既存のアプローチでは、複数のコード候補間の動的コンセンサス – コストがかかり、スケールが難しい – に依存するか、動的バグや順序バイアスの影響を受けやすい静的な推論に依存しています。本稿では,具体的(インプット,アウトプット)なペアによるLLM推論を基礎とするTRAILS~を提案する。 TRAILS~はまず仕様に基づいてカテゴリパーティショニングを通じて多様なテストインプットを生成し、それから候補コードに対してそれらを実行し、結果のインプットとアウトプットのペアが仕様に準拠しているかどうかをLCMに判断するよう促す。スコアは入力間で集約され、プログラムが正しいかどうかを決定する。 TRAILS~をLiveCodeBenchとCoCoClaNeLの3つのLLM(Qwen3Coder-30B、Devstral-Small-24B、Olmo3.1-Instruct)で評価し、HoarePromptとZero-Shot Chain-of-Thoughtベースラインと比較した。 TRAILS~は、Zero-Shot COT と比較して Matthew 相関係数を 39 % 改善し、一貫して HoarePrompt を上回っている。 TRAILS~は、シード実行時の安定性を向上し、LCM非決定性に対する感度を低下させ、競合するアプローチよりも大きなユニークなコードサンプルに正しいラベルを割り当てる。

論文の概要: Inferring Code Correctness from Specification

関連論文リスト