Fugu-MT 論文翻訳(概要): From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting

論文の概要: From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting

arxiv url: http://arxiv.org/abs/2603.19254v1
Date: Wed, 25 Feb 2026 13:44:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.782304
Title: From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting
Title（参考訳）: 理解から推論へ:金融調査報告の階層的ベンチマーク
Authors: Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng, Jinru Ding, Yejie Zheng, Jie Xu,
Abstract要約: FinReasoningは、中国のリサーチレポート生成を3段階に分解するベンチマークだ。評価結果に基づいて、FinReasoningはほとんどのモデルが理解と実行のギャップを示すことを示した。
参考スコア（独自算出の注目度）: 19.0993436440595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures--factual errors, numerical inconsistencies, fabricated references, and shallow analysis--that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at https://github.com/TongjiFinLab/FinReasoning.
Abstract（参考訳）: 大規模言語モデル(LLM)は、補助的な分析ツールから一次コンテンツ生産者へと移行し、金融調査レポートを生成するためにますます使われてきている。しかし、最近の現実世界の展開では、実際のエラー、数値的な矛盾、製造された基準、浅い分析といった持続的な失敗が明らかになっている。しかし、既存の金融ベンチマークでは、モデルが信頼できる分析を作成できるかどうかを評価するのではなく、完了レポートに対する理解に焦点を当てている。さらに、現在の評価フレームワークは、単に幻覚をフラグし、より深い分析スキルのための構造化された尺度を欠いているだけであり、主要な分析ボトルネックは未発見のままである。これらのギャップに対処するために、FinReasoningという、中国のリサーチレポート生成を、実際のアナリストワークフローに沿った3つのステージに分解し、セマンティック一貫性、データアライメント、深い洞察を評価するベンチマークを紹介します。さらに、幻覚補正評価を強化し、コア分析技術に12指標ルーブリックを組み込んだきめ細かい評価フレームワークを提案する。評価結果に基づいて、FinReasoningは、ほとんどのモデルでは理解と実行のギャップがあることを明らかにしている。さらに、Douubao-Seed-1.8、GPT-5、Kimi-K2の3つのトラックで圧倒的に優越するモデルはない。評価リソースはhttps://github.com/TongjiFinLab/FinReasoning.comで公開されている。

論文の概要: From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting

関連論文リスト