Fugu-MT 論文翻訳(概要): The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

論文の概要: The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

arxiv url: http://arxiv.org/abs/2508.16131v1
Date: Fri, 22 Aug 2025 06:51:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-25 16:42:36.273449
Title: The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion
Title（参考訳）: Foolsは確実に、Wiseは疑わしい:コード補完におけるLCMの信頼性を探る
Authors: Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas,
Abstract要約: コードパープレキシティの測定により,コード生成時のLLM(Large Language Models)の信頼性を評価する。強い型付け言語は動的型付け言語よりも難易度が低いことがわかった。 Perlは難易度が普遍的に高いが、Javaは低いように見える。
参考スコア（独自算出の注目度）: 4.215010577170175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 1008 files from 657 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Perl appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM, but not on the code dataset. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.
Abstract（参考訳）: コード補完は、周囲のコンテキストに与えられた欠落したトークンを提供するタスクを伴います。強力なコード発見ツールを提供しながら、開発者の生産性を高めることができる。 LLM(Large Language Model)波の後、コードに微調整された多種多様なLLM(code LLM)でコード補完がアプローチされた。コードLLMのパフォーマンスは、下流と固有のメトリクスで評価できる。下流のメトリクスは通常、モデルの実用性を評価するために使用されるが、信頼性が低く、複雑な計算とドメイン固有の知識を必要とする。対照的に、モデル信頼度や不確実性を測定するパープレキシティ、エントロピー、相互情報といった内在的な指標は、単純で汎用的で、LLMやタスク全体にわたって普遍的であり、LLM生成コードの機能的正しさと幻覚リスクのプロキシとして機能する。これにより,プログラム言語,モデル,データセット間のコードの複雑度を測定してコードを生成する場合のLCMの信頼性と,GitHubプロジェクト657の1008ファイルのサンプルを評価した。強い型付け言語は動的型付け言語よりも難易度が低いことがわかった。スクリプト言語はより複雑であることも示している。 Perlは難易度が普遍的に高いが、Javaは低いように見える。コードの難易度は、採用されているLLMに依存するが、コードデータセットには依存しない。コードコメントは難易度を高めることが多いが、難易度に基づく言語ランキングはその存在によってほとんど影響を受けない。 LLMの研究者、開発者、ユーザは、言語、モデル選択、そしてコード特性がモデルの信頼性にどのように影響するかに基づいて、特定のソフトウェアプロジェクトにおいてLLMベースのコード補完の利点と適合性を評価するために、我々の研究結果を利用することができます。

論文の概要: The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

関連論文リスト