Fugu-MT 論文翻訳(概要): The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

論文の概要: The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

arxiv url: http://arxiv.org/abs/2606.22792v1
Date: Mon, 22 Jun 2026 03:05:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 04:36:17.950596
Title: The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models
Title（参考訳）: 確率性の起源:大規模言語モデルにおける不確実性定量化に関する包括的考察
Authors: Xiang-Jun Ou, Shuang Liang, Xin-Yu Hu, Rong-Hao Huang, Jing Wang, Shao-Qun Zhang,
Abstract要約: 本稿では,不確かさを入力レベル,パラメータレベル,トークンレベル,復号処理源に分類する,きめ細かい不確実性分類法を提案する。多様な世代設定とメトリクスを網羅した総合評価フレームワークを導入する。実験の結果、(i)UQ手法の有効性はタスクタイプや生成設定に敏感であり、(ii)コンセンサスに基づく手法は、他のUQ手法よりも一貫して優れており、(iii)より大規模なモデルスケールは、低い不確実性推定と相関していることが示された。
参考スコア（独自算出の注目度）: 12.213066436465601
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Large Language Models (LLMs) have enabled sophisticated reasoning and content generation, yet their inherent stochasticity poses significant challenges for ensuring predictive credibility. While traditional uncertainty taxonomy paradigms, such as the dichotomy of aleatoric and epistemic uncertainties, provide conceptual foundations, they often fail to capture the multi-component and multi-stage nature of LLM generation and struggle to evaluate the effectiveness of various Uncertainty Quantification (UQ) methods. In this paper, we propose a granular uncertainty taxonomy that systematically attributes LLM uncertainty into input-level, parameter-level, token-level, and decoding-process sources. Correspondingly, we categorize existing UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches. Furthermore, we introduce a comprehensive evaluation framework covering diverse generation settings and metrics. We empirically evaluate 21 typical UQ methods across three prominent LLM families, including Qwen3, Llama 3.2, and DeepSeek-V3, on benchmarks such as TriviaQA, GSM8K, and HumanEval. Our experimental results demonstrate that (i) the effectiveness of UQ methods is sensitive to task types and generation settings; (ii) consensus-based methods, typed Deg and EigV, consistently outperform other UQ approaches; and (iii) larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law for LLM uncertainty. This work bridges the gap between theoretical origins and practical deployment, providing a versatile diagnostic tool for systematically quantifying uncertainty in LLM applications.
Abstract（参考訳）: 近年のLarge Language Models(LLM)の進歩により、洗練された推論とコンテンツ生成が可能になったが、その固有の確率性は予測可能性を確保する上で大きな課題となっている。従来の不確実性分類のパラダイム、例えば、失語症とてんかんの不確実性の二分法は概念的基盤を提供するが、LLM生成の多成分および多段階の性質を捉えることができず、様々な不確実性定量化(UQ)手法の有効性を評価するのに苦労する。本稿では, LLMの不確かさを入力レベル, パラメータレベル, トークンレベル, 復号処理ソースに体系的に属性付ける, きめ細かい不確実性分類法を提案する。それに対応して、既存のUQ手法をベイズ、アンサンブル、コンセンサスベース、シングルパスアプローチに分類する。さらに、多様な世代設定とメトリクスを網羅した総合的な評価フレームワークを導入する。我々は、TriviaQA、GSM8K、HumanEvalなどのベンチマークで、Qwen3、Llama 3.2、DeepSeek-V3を含む3つのLLMファミリーにまたがる21の典型的なUQ手法を実証的に評価した。我々の実験結果は i)UQ手法の有効性はタスクタイプや生成設定に敏感である。 (ii)コンセンサスに基づく方法,型付きDegとEigVは,他のUQアプローチよりも一貫して優れています。 (3) LLMの不確実性に対する経験的スケーリング法則を示唆し, モデルスケールは低い不確実性推定値と相関する。この研究は理論的起源と実践的展開のギャップを埋め、LLMアプリケーションの不確実性を体系的に定量化する汎用的な診断ツールを提供する。

論文の概要: The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

関連論文リスト