Fugu-MT 論文翻訳(概要): Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

論文の概要: Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

arxiv url: http://arxiv.org/abs/2604.06723v1
Date: Wed, 08 Apr 2026 06:41:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.368924
Title: Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision
Title（参考訳）: 自動コード修正におけるLCMの信頼性校正のためのきめ細かいアプローチ
Authors: Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, Christoph Treude,
Abstract要約: 正準緩和法は、インスタンスレベルでの正当性を忠実に反映した正当性スコアを提供することである。本研究は, 局所プラッツスケーリングを3種類の微粒な信頼度スコアに別々に適用することを提案する。より広い範囲の確率区間にわたるキャリブレーション誤差を, 微粒な信頼度スコアが常に低いキャリブレーション誤差を達成できることが判明した。
参考スコア（独自算出の注目度）: 16.289117637700446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In today's AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model's capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for automated code revision (ACR) tasks such as program repair, vulnerability repair, and code refinement. We hypothesise that the coarse-grained nature of this conventional method makes it ill-suited for ACR tasks, where correctness is often determined by local edit decisions and miscalibration can be sample-dependent, thereby motivating fine-grained confidence calibration. To address this, our study proposes local Platt-scaling applied separately to three different fine-grained confidence scores. Through experiments across 3 separate tasks and correctness metrics, as well as 14 different models of various sizes, we find that fine-grained confidence scores consistently achieve lower calibration error across a broader range of probability intervals, and this effect is further amplified when global Platt-scaling is applied. Our proposed approaches offer a practical solution to eliciting well-calibrated confidence scores, enabling more trustworthy and streamlined usage of imperfect models in ACR tasks.
Abstract（参考訳）: 今日のAI支援ソフトウェアエンジニアリングの世界では、開発者は高い能力を持つが本質的には不完全であるLCMに依存している。これらのモデルが誤ったアウトプットを生み出す傾向は、開発者の生産性を低下させる。この目的のために、標準緩和法は、インスタンスレベルでの正しさの可能性を忠実に反映した、校正された信頼スコアを提供することである。このような情報により、アウトプットの受け入れに関する即時決定、エラーを起こしやすいアウトプットの排除、モデルの能力との期待の整合性が向上する。ポストトレーニング後のLLMは、本質的にはよく校正された信頼スコアを生成するわけではないため、研究者は、多くの生成的ソフトウェア工学タスクに有効であるが、プログラムの修復、脆弱性修復、コード修正のような自動コード修正(ACR)タスクのために、信頼できない、または未調査のままである、シーケンスレベルの信頼スコアのグローバルなプラットスケーリングを用いて、ポストホックキャリブレーション法を開発した。本手法の粗粒度特性はACRタスクに不適であり, 局所的な編集決定によって正しさが決定され, 誤校正が標本依存となり, きめ細かな信頼度校正の動機となることが推測された。そこで本研究では,局所プラッツスケーリングを3種類の微粒な信頼度スコアに別々に適用することを提案する。 3つの異なるタスクと正当性の測定値、および14の異なるモデルを用いて、より広い確率間隔で微粒な信頼度スコアが低いキャリブレーション誤差を連続的に達成し、この効果は、グローバルプラッツスケーリングを適用する際にさらに増幅される。提案手法は,ACRタスクにおける不完全モデルの信頼性向上と合理化を実現し,信頼度の高い信頼度を求めるための実用的なソリューションを提供する。

論文の概要: Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

関連論文リスト