Fugu-MT 論文翻訳(概要): Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

論文の概要: Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

arxiv url: http://arxiv.org/abs/2508.12358v1
Date: Sun, 17 Aug 2025 13:07:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.703017
Title: Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications
Title（参考訳）: 自然言語仕様に対するコード検証におけるLLMの系統的失敗の発見
Authors: Haolin Jin, Huaming Chen,
Abstract要約: 大規模言語モデル(LLM)はソフトウェア開発において不可欠なツールとなり、要求工学、コード生成、レビュータスクに広く利用されている。本稿では,LLMが自然言語の要求に適合するかどうかを評価する上で,体系的に失敗していることを明らかにする。以上の結果から,LCMは要件を満たすことのできないコード実装や潜在的な欠陥を含むコード実装を誤って分類することが多いことが判明した。
参考スコア（独自算出の注目度）: 0.6813925418351435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to assess whether system code implementation satisfy task requirements, thereby enhancing code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine whether the code complies fully with the given task descriptions, which is usually natural language specifications. In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements. Specifically, with widely used benchmarks, we employ unified prompts to judge code correctness. Our results reveal that LLMs frequently misclassify correct code implementations as either ``not satisfying requirements'' or containing potential defects. Surprisingly, more complex prompting, especially when leveraging prompt engineering techniques involving explanations and proposed corrections, leads to higher misjudgment rate, which highlights the critical reliability issues in using LLMs as code review assistants. We further analyze the root causes of these misjudgments, and propose two improved prompting strategies for mitigation. For the first time, our findings reveals unrecognized limitations in LLMs to match code with requirements. We also offer novel insights and practical guidance for effective use of LLMs in automated code review and task-oriented agent scenarios.
Abstract（参考訳）: 大規模言語モデル(LLM)はソフトウェア開発において不可欠なツールとなり、要求工学、コード生成、レビュータスクに広く利用されている。ソフトウェアエンジニアはしばしば、システムコードの実装がタスク要求を満たすかどうかを評価するためにLLMに依存し、それによってコードの堅牢性と精度が向上する。しかし、LLMが与えられたタスク記述に完全に準拠するかどうか、通常は自然言語仕様であるかどうかを確実に判断できるかどうかは不明だ。本稿では,LLMが自然言語の要求に適合するかどうかを評価する上で,体系的に失敗していることを明らかにする。特に、広く使われているベンチマークでは、コードの正しさを判断するために統一的なプロンプトを使用します。以上の結果から,LCMは,''要求を満たすことができない' あるいは潜在的な欠陥を含む' として,正しいコード実装を誤って分類することが多いことが判明した。驚くべきことに、特に説明や修正提案を含む素早い技術技術を活用する場合、より複雑なプロンプトは、誤判定率を高くし、コードレビューアシスタントとしてLLMを使用する際の重大な信頼性の問題を強調している。さらに,これらの誤報の根本原因を分析し,緩和のための2つの改善策を提案する。この結果から,LLMにおけるコードと要件との整合性に制限が認められなかったことが明らかとなった。また、自動コードレビューやタスク指向エージェントシナリオでLLMを効果的に活用するための、新しい洞察と実践的なガイダンスも提供します。

論文の概要: Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

関連論文リスト