Fugu-MT 論文翻訳(概要): Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

論文の概要: Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

arxiv url: http://arxiv.org/abs/2510.15079v1
Date: Thu, 16 Oct 2025 18:48:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-20 20:17:34.360904
Title: Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models
Title（参考訳）: 大規模言語モデルによるコード実行推論の一貫性と一貫性の評価
Authors: Changshu Liu, Yang Chen, Reyhaneh Jabbarvand,
Abstract要約: 本稿では,プログラム実行をシミュレーションし,その推論をプログラミングタスクで活用する上で,LLMの能力を評価するタスクであるCESを提案する。 CESでは、シミュレーションがコモンセンスの実行ロジックに準拠するかどうかを判断するために、コヒーレンスの概念を紹介している。 CESはまた、スペクトルにおいて同じまたは異なる素路カバレッジを持つテスト間での推論整合性を測定するための新しいメトリクスも導入している。
参考スコア（独自算出の注目度）: 5.692204231573854
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper proposes CES, a task to evaluate the abilities of LLMs in simulating program execution and using that reasoning in programming tasks. Besides measuring the correctness of variable predictions during execution simulation, CES introduces the notion of coherence to determine whether the simulation complies with commonsense execution logic, even if the predicted values along the simulations are incorrect. This enables CES to rule out suspiciously correct output predictions due to reasoning shortcuts, hallucinations, or potential data leakage. CES also introduces a novel metric to measure reasoning consistency across tests with the same or different prime path coverage in a spectrum: strong, weak, and random. Evaluating 16 LLMs (including three reasoning LLMs) using CES indicates 81.42% coherent execution simulation on HumanEval, 46.92% and 53.08% of which result in correct and incorrect output predictions. Frontier LLMs such as GPT-4 and DeepSeek-R1 have the most incoherent execution reasoning, mostly due to natural language shortcuts. Despite relatively coherent execution simulation, LLMs' reasoning performance across different tests is inconsistent, mostly random (48.87%) or weak (45.37%), potentially explaining their weakness in programming tasks that require path-sensitive program analysis to succeed. We also compare CES with bug prediction/localization/repair, which intuitively requires control- and data-flow awareness. We observe that LLMs barely incorporate execution reasoning into their analysis for bug-related tasks, and their success is primarily due to inherent abilities in pattern matching or natural language shortcuts, if not data leakage. Without reasoning, there is a threat to the generalizability of LLMs in dealing with unseen bugs or patterns in different contexts. CES can be used to vet the suspicious success of LLMs in these tasks systematically.
Abstract（参考訳）: 本稿では,プログラム実行をシミュレーションし,その推論をプログラミングタスクで活用する上で,LLMの能力を評価するタスクであるCESを提案する。実行シミュレーションにおける変数予測の正確性の測定に加えて、CESではコヒーレンスの概念を導入して、シミュレーションがコモンセンス実行ロジックに準拠するかどうかを判断する。これにより、CESはショートカットや幻覚、潜在的なデータ漏洩の理由から、不審に正確な出力予測を除外できる。 CESはまた、スペクトルにおいて同じまたは異なる素路カバレッジを持つテスト間での推論整合性を測定するための新しいメトリクスも導入している。 CESによる16個のLLM(3つのLCMを含む)の評価は、HumanEval上でのコヒーレントな実行シミュレーションが81.42%、46.92%、そして53.08%であり、正しい出力予測と誤った出力予測をもたらすことを示している。 GPT-4やDeepSeek-R1のような最前線のLCMは、主に自然言語のショートカットのため、最も非一貫性な実行推論を持つ。比較的コヒーレントな実行シミュレーションにもかかわらず、異なるテスト間でのLCMの推論性能は不整合であり、ほとんどランダム(48.87%)または弱(45.37%)であり、パスに敏感なプログラム分析を必要とするプログラミングタスクの弱点を説明する可能性がある。われわれはCESを、直感的に制御とデータフローの認識を必要とするバグ予測/ローカライゼーション/リペアと比較した。 LLMは、バグ関連タスクの分析に実行推論をほとんど組み込んでおらず、その成功は主にパターンマッチングや自然言語のショートカットに固有の能力があるためである。推論なしでは、異なるコンテキストで目に見えないバグやパターンを扱う際に、LCMの一般化可能性に対する脅威があります。 CESは、これらのタスクにおけるLLMの不審な成功をシステム的に検証するために使用することができる。

論文の概要: Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

関連論文リスト