Fugu-MT 論文翻訳(概要): Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

論文の概要: Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

arxiv url: http://arxiv.org/abs/2511.02197v1
Date: Tue, 04 Nov 2025 02:30:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.778025
Title: Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
Title（参考訳）: Oysterのオープン: LLMにおけるコード推論信頼の実証評価と改善
Authors: Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia,
Abstract要約: 本稿では,大規模言語モデル(LLM)の信頼性解析と拡張フレームワークを提案する。本研究は,各タスクにまたがるメインストリームLLMの信頼性に関する総合的な実証的研究を行う。さらに,信頼度を向上させるために,迅速な戦略最適化や数学的キャリブレーションなどの手法の有効性を検証した。
参考スコア（独自算出の注目度）: 16.02000925637464
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.
Abstract（参考訳）: コードインテリジェンス分野における大規模言語モデル(LLM)の広範な適用により、コード推論タスクにおける出力の信頼性と制御性に注意が払われている。信頼度推定は、これらの側面を評価するのに効果的で便利なアプローチである。本稿では,コード推論タスクに適したLCMの信頼性解析と拡張フレームワークを提案する。本研究は,各タスクにまたがる主要なLCMの信頼性に関する総合的な実証的研究を行い,信頼度を向上させるために,迅速な戦略最適化や数学的キャリブレーション(例えば,プラットスケーリング)といった手法の有効性を検証した。以上の結果から,DeepSeek-Reasonerは,ECE,Brier Score,Performance Scoreでそれぞれ0.680$,0.636$,13.652$など,さまざまなタスクで最高のパフォーマンスを実現していることがわかった。再評価のプロンプト戦略とPlatt Scalingを組み合わせたハイブリッド戦略は、上記の3つの指標における当初のパフォーマンスよりも0.541$、0.628$、および15.084$の改善を達成している。これらの結果から, 推理能力を有するモデルでは信頼性が向上し, ハイブリッド戦略が信頼性の向上に最も有効であることが示唆された。一方、タスクの複雑さ、モデルスケール、戦略が信頼性に与える影響を解明し、複雑な推論タスクにおける現在のLCMの信頼性は、まだ改善の余地が十分にあることを強調する。本研究は、LCM支援ソフトウェアエンジニアリングにおける信頼性の応用に関する研究基盤と技術的基準を提供するだけでなく、信頼性メカニズムの将来の最適化とエンジニアリング展開の方法も指摘する。

論文の概要: Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

関連論文リスト