Fugu-MT 論文翻訳(概要): Confidence Calibration under Ambiguous Ground Truth

論文の概要: Confidence Calibration under Ambiguous Ground Truth

arxiv url: http://arxiv.org/abs/2603.22879v1
Date: Tue, 24 Mar 2026 07:24:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.355867
Title: Confidence Calibration under Ambiguous Ground Truth
Title（参考訳）: 曖昧な地中真理下における信頼度校正
Authors: Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu,
Abstract要約: 多数発声ラベルに装着したポストホックキャリブレータは、従来の評価ではよく校正される。しかし、それらが根底にあるアノテータ分布に対してかなり誤解されている。我々は,完全ラベル分布に対して適切なスコアリングルールを最適化する,あいまいさを意識したポストホックキャリブレータのファミリーを開発する。
参考スコア（独自算出の注目度）: 43.71398545904091
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.
Abstract（参考訳）: 信頼度キャリブレーションは入力ごとに独特な基底真実ラベルを仮定するが、この仮定はアノテータが真に一致しない場合に失敗する。標準の単一ラベルターゲットである多数発声ラベルに装着したポストホックキャリブレータは、従来の評価ではよく校正されるが、基礎となるアノテータ分布に対して実質的に校正される。仮定を単純化すると、温度スケーリングはアノテータの不確かさを過小評価する温度に偏り、真のラベルの誤校正はアノテーションのエントロピーとともに単調に増加する。そこで本研究では,完全ラベル分布に対して適切なスコアリングルールを最適化し,モデルの再学習を必要としない,あいまいさを意識したポストホックキャリブレータのファミリーを開発する。ディリクレ・ソフト(Dirichlet-Soft)は、全アノテータ分布を活用し、設定間で最高の全体的なキャリブレーション品質を達成する、モンテカルロ温度スケーリング(MCTS S=1)は、すべてのベンチマークでフルアグリゲーションキャリブレーションのキャリブレーションを一致させ、事前アグリゲーションされたラベル分布が不要であることを示す、ラベル-スムース温度スケーリング(LS-TS)は、モデル自身の信頼度からデータ駆動の擬似ソフトターゲットを構築することで、投票されたラベルでのみ動作する。実マルチアノテータ分布(CIFAR-10H, ChaosNLI)と臨床的にインフォームドされた合成アノテーション(ISIC~2019, DermaMNIST)を持つ4つのベンチマークの実験では、Dirichlet-Softは温度スケーリングに対して真のラベルCEを55～87%削減し、LS-TSはアノテータデータなしでCEを9～77%削減した。

論文の概要: Confidence Calibration under Ambiguous Ground Truth

関連論文リスト