Fugu-MT 論文翻訳(概要): CalArena: A Large-Scale Post-Hoc Calibration Benchmark

論文の概要: CalArena: A Large-Scale Post-Hoc Calibration Benchmark

arxiv url: http://arxiv.org/abs/2605.30188v1
Date: Thu, 28 May 2026 16:31:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.536546
Title: CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Title（参考訳）: CalArena: 大規模ポストホック校正ベンチマーク
Authors: Eugène Berta, David Holzmüller, Francis Bach, Michael I. Jordan,
Abstract要約: ポストホックキャリブレーションのための大規模で標準化されたベンチマークを導入する。私たちのベンチマークでは、さまざまな古典モデル、現代的なディープラーニングアーキテクチャ、基礎モデルからの予測を集約しています。適切なスコアリングルールにおけるポストホック改善(PHI)は、従来のキャリブレーション誤差推定器に代わる原則的な代替手段であると主張する。
参考スコア（独自算出の注目度）: 48.0798861811642
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.
Abstract（参考訳）: 多くの機械学習アプリケーションでは信頼性の高い確率推定が重要であるが、現代の分類器は校正が不十分であることが多い。ポストホックキャリブレーションは単純で広く使われている解であるが,提案手法と小規模かつ一貫性のない評価を組み合わせることで,どの手法が実際に有効なのかを判断することが困難である。本稿では,2次,複数クラス,大規模分類設定を含む表型およびコンピュータビジョンタスクにおける2000近い実験を対象とする,ポストホック校正のための大規模で標準化されたベンチマークを紹介する。我々のベンチマークは、様々な古典的モデル、現代のディープラーニングアーキテクチャ、基礎モデルからの予測を集約し、共通の評価フレームワーク内で数十のキャリブレーション手法の統一的で再現可能な実装を提供する。我々は、適切なスコアリングルールにおけるポストホック改善(PHI)は、従来のキャリブレーション誤差推定法に代えて、キャリブレーション品質とモデル予測性能の潜在的な劣化の両方をキャプチャして、ポストホック法の比較を行う。この枠組みを用いて, これまでに最も包括的なポストホックキャリブレーションの実証的研究を行った。この結果から,スムーズなキャリブレーション関数はビンニング方式よりも優れており,高次元設定では専用のマルチクラス手法が不可欠であり,汎用機械学習モデルはキャリブレーション特化設計なしでは競合しないことがわかった。今後の研究を容易にするため,キャリブレーション手法の開発と比較を行うためのプラグイン・アンド・プレイ・ベンチマークとして,すべてのデータ,コード,評価ツールをリリースする。

論文の概要: CalArena: A Large-Scale Post-Hoc Calibration Benchmark

関連論文リスト