Fugu-MT 論文翻訳(概要): Does In-IDE Calibration of Large Language Models work at Scale?

論文の概要: Does In-IDE Calibration of Large Language Models work at Scale?

arxiv url: http://arxiv.org/abs/2510.22614v1
Date: Sun, 26 Oct 2025 10:15:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.29641
Title: Does In-IDE Calibration of Large Language Models work at Scale?
Title（参考訳）: 大規模言語モデルのIDE内校正は大規模に機能するのか?
Authors: Roham Koohestani, Agnia Sergeyuk, David Gros, Claudio Spiess, Sergey Titov, Prem Devanbu, Maliheh Izadi,
Abstract要約: 内部モデル信頼性のポストホック校正は、確率を許容可能性尺度に合わせることを目的としている。オープンソースのモデルのキャリブレーション重み付けに使用できる,スケーラブルで柔軟なキャリブレーションフレームワークを開発した。 2400万人以上の現実世界の開発者インタラクションを大規模に分析した結果、プラッツスケーリングに基づく一般的なポストホックキャリブレーションモデルでは、平均してモデル信頼性信号の信頼性が向上しないことがわかった。
参考スコア（独自算出の注目度）: 4.707628898226459
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The introduction of large language models into integrated development environments (IDEs) is revolutionizing software engineering, yet it poses challenges to the usefulness and reliability of Artificial Intelligence-generated code. Post-hoc calibration of internal model confidences aims to align probabilities with an acceptability measure. Prior work suggests calibration can improve alignment, but at-scale evidence is limited. In this work, we investigate the feasibility of applying calibration of code models to an in-IDE context. We study two aspects of the problem: (1) the technical method for implementing confidence calibration and improving the reliability of code generation models, and (2) the human-centered design principles for effectively communicating reliability signal to developers. First, we develop a scalable and flexible calibration framework which can be used to obtain calibration weights for open-source models using any dataset, and evaluate whether calibrators improve the alignment between model confidence and developer acceptance behavior. Through a large-scale analysis of over 24 million real-world developer interactions across multiple programming languages, we find that a general, post-hoc calibration model based on Platt-scaling does not, on average, improve the reliability of model confidence signals. We also find that while dynamically personalizing calibration to individual users can be effective, its effectiveness is highly dependent on the volume of user interaction data. Second, we conduct a multi-phase design study with 3 expert designers and 153 professional developers, combining scenario-based design, semi-structured interviews, and survey validation, revealing a clear preference for presenting reliability signals via non-numerical, color-coded indicators within the in-editor code generation workflow.
Abstract（参考訳）: 大規模言語モデルを統合開発環境(IDE)に導入することは、ソフトウェア工学に革命をもたらすが、人工知能が生成するコードの有用性と信頼性に課題をもたらす。内部モデル信頼性のポストホック校正は、確率を許容可能性尺度に合わせることを目的としている。以前の研究はキャリブレーションがアライメントを改善することを示唆していたが、大規模な証拠は限られている。本研究では,コードモデルのキャリブレーションをIDE内コンテキストに適用できる可能性について検討する。本研究では,(1)信頼度校正とコード生成モデルの信頼性向上のための技術手法,(2)信頼性信号を開発者へ効果的に伝達するための人間中心設計原則について検討する。まず,任意のデータセットを用いて,オープンソースモデルのキャリブレーション重みを求めるのに使用可能な,スケーラブルで柔軟なキャリブレーションフレームワークを開発し,キャリブレータがモデルの信頼性と開発者の受け入れ動作の整合性を改善するかどうかを評価する。複数のプログラミング言語をまたいだ2400万以上の実世界の開発者インタラクションを大規模に分析した結果、プラッツスケーリングに基づく一般的なポストホックキャリブレーションモデルでは、平均してモデル信頼性信号の信頼性が向上しないことがわかった。また,個々のユーザに対するキャリブレーションを動的にパーソナライズすることは有効であるが,その有効性はユーザインタラクションデータの量に大きく依存している。第2に、シナリオベース設計、半構造化インタビュー、調査検証を組み合わせることで、3人のエキスパートデザイナーと153人のプロフェッショナル開発者によるマルチフェーズ設計研究を行い、インエディタコード生成ワークフロー内で、非数値的、カラーコード化されたインジケータによる信頼性信号の提示を明らかにした。

論文の概要: Does In-IDE Calibration of Large Language Models work at Scale?

関連論文リスト