Fugu-MT 論文翻訳(概要): ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

論文の概要: ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

arxiv url: http://arxiv.org/abs/2603.29928v1
Date: Tue, 31 Mar 2026 16:01:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.782134
Title: ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules
Title（参考訳）: ScoringBench: 適切なスコーリングルールによるタブラルファウンデーションモデルの評価ベンチマーク
Authors: Jonas Landsgesell, Pascal Knoll,
Abstract要約: TabPFN(英語版)やTabICL(英語版)のようなタブラル基礎モデルは、既に完全な分布を生成しているが、回帰ベンチマーク(英語版)はRMSE R2(英語版)によってほぼ独占的に評価されている。 ScoringBenchは、CRPS CRLS Interval Score Energy Score weighted CRPSやBrier Scoreといった適切なスコアルールの総合的なスイートを標準点メトリクスとともに計算するオープンベンチマークである。結果は、モデルランキングが選択されたスコアリングルールに依存し、単一の事前学習目標が普遍的に最適でないことを確認した。
参考スコア（独自算出の注目度）: 0.7009487789080344
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at https://github.com/jonaslandsgesell/ScoringBench A live preview of the current leaderboard is available at https://scoringbench.bolt.host The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility
Abstract（参考訳）: TabPFNやTabICLのようなタブラルな基盤モデルは、まだ完全に予測的な分布をすでに生成しているが、回帰ベンチマークは、ほぼ独占的に評価されている RMSE R2 これらの総合的尺度は、しばしば、ポイント見積の指標によって評価されている RMSE R2 これらの総合的尺度は、分布の尾部において、しばしば不明瞭なパフォーマンスをモデル化する金融や臨床研究のような、非対称なリスクプロファイルが標準である領域において、高い利害決定のための重要な欠陥をモデル化するスコリングベンチは、CRPS CRLSのような適切な評価ルールの包括的なスイートを計算するオープンベンチマークを紹介します。

論文の概要: ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

関連論文リスト