Fugu-MT 論文翻訳(概要): QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

論文の概要: QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

arxiv url: http://arxiv.org/abs/2604.15859v1
Date: Fri, 17 Apr 2026 09:06:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.847597
Title: QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
Title（参考訳）: QuantSightBench:予測間隔によるLLM定量予測の評価
Authors: Jeremy Qin, Maksym Andriushchenko,
Abstract要約: 数値予測のための点推定よりも適切な評価形式として予測区間を提案する。この能力を評価するために、新しいベンチマークQuantSightBenchを導入し、複数の設定下でフロンティアモデルを評価する。評価されたフロンティアモデルとオープンウェイトモデルのうち、90%のカバレッジ目標を達成できず、トップパフォーマーのジェミニ3.1 Pro(79.1%)、Grok4(76.4%)、GPT-5.4(75.3%)はいずれも少なくとも10ポイント不足している。
参考スコア（独自算出の注目度）: 18.055205233907248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
Abstract（参考訳）: 予測は不確実性の下での推論の自然なベンチマークとなっている。しかし、大規模な言語モデルの既存の評価は、バイナリや複数選択質問のような単純な形式での判断タスクに限られている。しかし実際には、予測はより広い範囲にまたがっている。経済学、公衆衛生学、社会人口統計学などの領域を越えて、決定は、現在のベンチマークでは捉えられない、連続的な量に関する数値的な見積もりにヒンジを当てる。このような見積もりを評価するには、不確実性を明確にし、テスト可能なフォーマットが必要である。この目的のために,予測間隔を自然かつ厳密なインタフェースとして提案する。彼らはスケール認識、信頼性レベルを越えた内部整合性、結果の連続性に対する校正を要求しており、数値予測の点推定よりも適切な評価形式である。この能力を評価するために、新しいベンチマークQuantSightBenchを導入し、複数の設定下でフロンティアモデルを評価し、経験的カバレッジとインターバルシャープネスの両方を評価する。その結果、評価されたフロンティアモデルとオープンウェイトモデルのうち、上位パフォーマーのGemini 3.1 Pro(79.1\%)、Grok 4(76.4\%)、GPT-5.4(75.3\%)はいずれも少なくとも10ポイント以下である。キャリブレーションは極端に大きく低下し、すべての評価されたモデルに対する体系的な過信が明らかになる。

論文の概要: QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

関連論文リスト