Fugu-MT 論文翻訳(概要): IPO Finance Agent: Evaluation of LLM Financial Analysts beyond Finance Agent v2, with Automated Rubric Generation -- the Case of the SpaceX (SPCX) IPO

論文の概要: IPO Finance Agent: Evaluation of LLM Financial Analysts beyond Finance Agent v2, with Automated Rubric Generation -- the Case of the SpaceX (SPCX) IPO

arxiv url: http://arxiv.org/abs/2606.23032v1
Date: Mon, 22 Jun 2026 08:42:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 16:10:15.173598
Title: IPO Finance Agent: Evaluation of LLM Financial Analysts beyond Finance Agent v2, with Automated Rubric Generation -- the Case of the SpaceX (SPCX) IPO
Title（参考訳）: IPO Finance Agent: LLM Financial Analysts beyond Finance Agent v2, with Automated Rubric Generation -- the SpaceX (SPCX) IPO
Authors: Mostapha Benhenda,
Abstract要約: ファイナンスエージェント v2 (Vals AI) が Anthropic Claude と OpenAI ChatGPT 両方のフロンティア言語モデルを評価する基準ベンチマークとして登場した。タスクドメインと検索アーキテクチャの2つの方向に沿ってファイナンスエージェントフレームワークを拡張したIPOファイナンスエージェントを紹介します。最高のパフォーマンス評価モデルであるAlibaba Qwen 3.7 Maxは、クエリ毎に79.4%の精度で0.30ドルに達し、その結果生まれたフロンティアであるXiaomi MiMo-2.5 Proにおける最もコスト効率のよいモデルであるXiaomi MiMo-2.5 Proは、クエリ毎に0.05ドルというやや低い精度(76.8%)に達した。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Finance Agent v2 (by Vals AI) has emerged as the reference benchmark for evaluating both Anthropic Claude and OpenAI ChatGPT frontier language models on financial tasks. However, it narrowly deals with periodic reporting from publicly traded companies (SEC 10-K and 10-Q filings), and its agentic harness relies on naive, unenriched chunk retrieval. Neither the task design nor the retrieval approach addresses the distinct challenges of IPO due diligence. SEC S-1 filings combine historical financial statements, governance structures, pro forma and common-control accounting treatments, capital-formation narratives, and underwriting-sensitive risk disclosures within substantially longer documents than typical periodic filings. That is why we introduce IPO Finance Agent, which extends the Finance Agent v2 framework along two directions: task domain and retrieval architecture. During our experiments, the original Finance Agent v2 harness basically failed to deliver any output related to the SpaceX S-1 filing, due to document length. We therefore had to improve the agentic harness with contextual retrieval, a more realistic and industry-standard approach for long documents. We also built a dataset of 1,000 IPO-diligence questions, and publicly release 70 questions on the SpaceX (SPCX) S-1 filing to support reproducibility, while the remainder are held private to guard against benchmark contamination. In addition, we introduce an evaluator-optimizer pipeline to automatically generate evaluation rubrics for the benchmark: candidate facts are extracted from an ensemble of independently-generated model answers to each question, consolidated into draft criteria, then automatically audited for omissions, hallucinations, mistiered items, and redundancy, with LLM feedback driving iterative repair, targeted enrichment, and deduplication. Human experts only review final rubrics before deployment. Results show that the best-performing evaluated model, Alibaba Qwen 3.7 Max, reaches 79.4% accuracy at $0.30 per query, and the most cost-efficient model on the resulting Pareto frontier, Xiaomi MiMo-2.5 Pro, reaches slightly lower accuracy (76.8%) at $0.05 per query. Both exceed the current Finance Agent v2 leaderboard ceiling-Google Gemini 3.5 Flash at 57.9% for $2.51 per querywhile undercutting even FABv2's cheapest entry (MiniMax M3: 48.3% at $0.32) on cost-efficiency. Code and data are released on GitHub: https://github.com/benstaf/ipoagent
Abstract（参考訳）: ファイナンスエージェント v2 (Vals AI) は、財務タスクにおける Anthropic Claude と OpenAI ChatGPT 両方のフロンティア言語モデルを評価する基準ベンチマークとして登場した。しかし、公開企業(SEC 10-Kと10-Qの申請書)からの定期的な報告を狭義に扱い、そのエージェントハーネスは単純で非リッチなチャンク検索に依存している。タスクデザインも検索アプローチも、IPOのデュー・ディリジェンス(double due Diligence)という別の課題に対処するものではない。 SEC S-1書類は、歴史的財務文書、統治構造、プロ・フォマおよび共通管理会計処理、資本形成の物語、そして典型的な定期的な文書よりもかなり長い文書内での引受に敏感なリスク開示を組み合わせている。これは金融エージェントv2フレームワークを拡張して、タスクドメインと検索アーキテクチャという2つの方向に進むものです。今回の実験では、もともとのファイナンスエージェント v2は、文書の長さのため、基本的にSpaceX S-1の申請に関する出力を届けられなかった。したがって、長い文書に対してより現実的で業界標準のアプローチである文脈検索を用いてエージェントハーネスを改善する必要があった。また、1000のIPOディリジェンス質問のデータセットを構築し、再現性をサポートするためにSpaceX(SPCX)のS-1申請書に70の質問を公開しました。さらに,各質問に対して独立に生成したモデル回答のアンサンブルから候補事実を抽出し,ドラフト基準に集約し,排便,幻覚,過度な項目,冗長性を自動監査し,LCMフィードバックによる反復的修復,目標エンリッチメント,重複度を推定する。人間の専門家は、配備前に最終破片のみをレビューする。その結果、最高のパフォーマンス評価モデルであるAlibaba Qwen 3.7 Maxは、クエリ毎に79.4%の精度で0.30ドルに達し、結果のParetoフロンティアであるXiaomi MiMo-2.5 Pro上で最もコスト効率のよいモデルであるXiaomi MiMo-2.5 Proは、クエリ毎に0.05ドルというやや低い精度(76.8%)に達した。どちらも現在のファイナンスエージェント v2 の天井版である Google Gemini 3.5 Flash を57.9%で、クエリ毎に$2.51で上回っているが、FABv2の最も安いエントリー(MiniMax M3: 48.3% at $0.32)でさえコスト効率が低い。コードとデータはGitHubでリリースされている。

論文の概要: IPO Finance Agent: Evaluation of LLM Financial Analysts beyond Finance Agent v2, with Automated Rubric Generation -- the Case of the SpaceX (SPCX) IPO

関連論文リスト