Fugu-MT 論文翻訳(概要): Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

論文の概要: Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

arxiv url: http://arxiv.org/abs/2605.00420v2
Date: Mon, 04 May 2026 07:21:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.432403
Title: Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
Title（参考訳）: Foresight Arena - AI予測エージェント評価のためのオンチェーンベンチマーク
Authors: Maksym Nechepurenko, Pavel Shuvalov,
Abstract要約: 私たちはForesight Arenaを紹介します。これは、現実世界の予測市場でAI予測エージェントを評価するための、最初の無許可のオンチェーンベンチマークです。パフォーマンスはBrier ScoreとAlpha Scoreによって測定される。 80%のパワーで$* = 0.02$の真のエッジを検出するには、約350の解決されたバイナリ予測が必要である。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the true forecasting ability of AI agents requires environments that are resistant to environments resistant to overfitting, free from centralized trust, and grounded in incentive-compatible scoring. Existing benchmarks either rely on static datasets vulnerable to training-data contamination, or measure trading PnL -- a metric conflating predictive accuracy with timing, sizing, and risk appetite. We introduce Foresight Arena, the first permissionless, on-chain benchmark for evaluating AI forecasting agents on real-world prediction markets. Agents submit probabilistic forecasts on binary Polymarket markets via a commit-reveal protocol enforced by Solidity smart contracts on Polygon PoS; outcomes are resolved trustlessly through the Gnosis Conditional Token Framework. Performance is measured by the Brier Score and a novel Alpha Score -- proper scoring rules that incentivize honest probability reporting and isolate predictive edge over market consensus. We provide a formal analysis: closed-form variance for per-market Alpha, the connection to Murphy's classical Brier decomposition, and a power analysis characterizing the number of rounds required to reliably distinguish agents of different skill levels. We show that detecting a true edge of $α^* = 0.02$ at 80% power requires approximately 350 resolved binary predictions (50 rounds of 7 markets), while $α^* = 0.01$ requires four times more. We complement these analytical results with a deterministic, seed-controlled simulation study calibrated to literature-reported Brier-score ranges, illustrating how Murphy decomposition distinguishes well-calibrated agents from market-tracking agents that fail through reduced resolution. Live results from the deployed benchmark will be reported in a future revision. All smart contracts and evaluation infrastructure are open-source.
Abstract（参考訳）: AIエージェントの真の予測能力を評価するには、過度な適合に抵抗する環境、集中的な信頼の欠如、インセンティブと互換性のあるスコアリングの基盤を必要とする。既存のベンチマークは、トレーニングデータの汚染に脆弱な静的データセットに依存するか、あるいはPnL(タイミング、サイズ、リスク食欲と予測精度を混在させるメトリクス)を計測する。私たちはForesight Arenaを紹介します。これは、現実世界の予測市場でAI予測エージェントを評価するための、最初の無許可のオンチェーンベンチマークです。エージェントは、Polygon PoS上でSolidityスマートコントラクトが実施するコミット調査プロトコルを通じて、バイナリポリマーケットの確率的予測を送信します。パフォーマンスは、Brier ScoreとAlpha Scoreによって測定される -- 市場のコンセンサスに対して、正直な確率報告と予測エッジの分離を動機付ける適切なスコアルールである。我々は、市場ごとのAlphaの閉形式分散、マーフィーの古典的ブライア分解への接続、および異なるスキルレベルのエージェントを確実に識別するために必要なラウンドの数を示すパワー分析を提供する。 80%のパワーで$α^* = 0.02$の真のエッジを検出するには、約350の解決された2進予想(7つの市場の50ラウンド)が必要であるのに対し、$α^* = 0.01$は4倍必要である。我々はこれらの分析結果を,文献に記載されたブライアスコア範囲に分類した決定論的・種制御シミュレーション研究で補完し,Murphyの分解が,分解能の低下によって失敗する市場追跡エージェントとよく校正されたエージェントをいかに区別するかを考察した。デプロイされたベンチマークのライブ結果は、今後の改訂で報告される。スマートコントラクトと評価インフラストラクチャはすべてオープンソースです。

論文の概要: Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

関連論文リスト