Fugu-MT 論文翻訳(概要): SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

論文の概要: SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

arxiv url: http://arxiv.org/abs/2509.24507v1
Date: Mon, 29 Sep 2025 09:21:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.894827
Title: SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code
Title（参考訳）: SemGuard: LLM生成コードの修正のためのリアルタイムセマンティック評価器
Authors: Qinglin Wang, Zhihong Sun, Ruyun Wang, Tao Huang, Zhi Jin, Ge Li, Chen Lyu,
Abstract要約: ポストホック修復パイプラインは、実行後にのみそのような障害を検出する。本稿では,実時間で行レベルのセマンティック監視を行うセマンティック評価フレームワークSemGuardを紹介する。
参考スコア（独自算出の注目度）: 46.20378145112059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) can translate natural language requirements into code, yet empirical analyses of representative models reveal that semantic errors-programs that compile but behave incorrectly-constitute the majority of observed faults (e.g., >60% on DeepSeek-Coder-6.7B and QwenCoder-7B). Post-hoc repair pipelines detect such faults only after execution, incurring latency, relying on incomplete test suites, and often mis-localizing the defect. Since semantic drift originates in the autoregressive decoding process, intervening while the code is being generated is a direct way to stop error propagation. Constrained-decoding approaches such as ROCODE attempt this, but still wait until the entire program runs to obtain feedback and use entropy heuristics that do not truly capture semantics. A more effective solution must inject semantic signals-early and precisely-into the decoding process.We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision. To train the evaluator, we build SemDiff, the first dataset with fine-grained annotations that mark the exact line where a correct and an incorrect implementation diverge. The evaluator, once embedded in the LLM's decoder, flags deviations on partial code, rolls back to the faulty line, and guides regeneration-without executing the program or requiring test cases. Across four benchmarks, SemGuard consistently outperforms state-of-the-art baselines. It lowers the semantic error rate by 19.86% on SemDiff relative to ROCODE, and lifts Pass@1 by 48.92% on the real-world LiveCodeBench with CodeLlama-7B. Similar gains hold for StarCoder2-7B on MBPP and for DeepSeekCoder-6.7B on the Java benchmark SemDiff-Java, demonstrating model- and language-agnostic effectiveness.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語の要求をコードに変換することができるが、代表モデルの実証分析により、コンパイルされるが正しく動作するセマンティックエラープログラムは、観測されたフォールトの大部分を構成する(例えば、DeepSeek-Coder-6.7BとQwenCoder-7Bの60%)。ホット後の修復パイプラインは、実行後にのみそのような障害を検出し、レイテンシーを発生し、不完全なテストスイートに依存し、しばしば欠陥を非ローカライズする。セマンティックドリフトは自己回帰復号プロセスに起因しているため、コードが生成される間はエラーの伝播を止める直接的な方法である。 ROCODEのような制約付きデコーディングアプローチは、これを試みているが、プログラム全体が実行されてフィードバックを取得し、意味論を真に捉えないエントロピーヒューリスティックを使用するのを待つ。より効果的なソリューションは、デコーディングプロセスにセマンティックシグナルを早期かつ正確に注入する必要がある。評価器をトレーニングするために、正しい実装と正しくない実装が分岐する正確な行を示す、きめ細かいアノテーションを備えた最初のデータセットであるSemDiffを構築します。 LLMのデコーダに埋め込まれた評価器は、部分的なコードへの偏差をフラグし、欠陥ラインにロールバックし、プログラムの実行やテストケースの必要なしに再生をガイドする。 4つのベンチマークで、SemGuardは一貫して最先端のベースラインを上回っている。セマンティックエラー率をROCODEに対するSemDiffで19.86%低下させ、CodeLlama-7Bで現実世界のLiveCodeBenchでPass@1を48.92%上昇させる。 MBPPのStarCoder2-7BやJavaベンチマークのSemDiff-JavaのDeepSeekCoder-6.7Bにも同様の利点があり、モデルと言語に依存しない効果を示している。

論文の概要: SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

関連論文リスト