Fugu-MT 論文翻訳(概要): FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

論文の概要: FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

arxiv url: http://arxiv.org/abs/2603.20252v1
Date: Wed, 11 Mar 2026 04:37:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.936937
Title: FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
Title（参考訳）: FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
Authors: Mahesh Kumar, Bhaskarjit Sarmah, Stefano Pasquali,
Abstract要約: 現在の知識グラフ(KG)によるQAシステムは幻覚を検出するための体系的なメカニズムを欠いている。 SEC10-K申請に対するKG強化財務QAにおける幻覚検出手法を評価するためのベンチマークであるFinBench-QA-Hallucinationを紹介する。本研究は、現在のKG強化システムにおける脆弱性を浮き彫りにし、信頼性の高い金融情報システムを構築するための洞察を提供する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.
Abstract（参考訳）: AIによる質問応答システムをコンプライアンスやリスクアセスメント、意思決定支援といった金融情報システムに統合する組織が増えている中、AI生成されたアウトプットの事実的正確性を保証することが、重要なエンジニアリング課題となっている。現在の知識グラフ(KG)によるQAシステムは、幻覚を検出するための体系的なメカニズムを欠いている。 SEC10-K申請に対するKG強化財務QAにおける幻覚検出手法を評価するためのベンチマークであるFinBench-QA-Hallucinationを紹介する。データセットには300ページから755の注釈付きサンプルが含まれており、それぞれがテキストチャンクと抽出されたリレーショナルトリプレットの両方からのサポートを必要とする保守的なエビデンス・リンクプロトコルを使用してグラウンドドネスにラベル付けされている。我々は,LLM判定器,微調整分類器,自然言語推論(NLI)モデル,スパン検出器,KG三重項を含む2つの条件下での埋め込みに基づく手法の6つの検出手法を評価する。その結果, 清浄条件下では, LLMに基づく判断および埋め込み手法が最も高い性能(F1: 0.82-0.86)が得られることがわかった。マシューズ相関係数 (MCC) は44～84パーセント低下する一方, 埋め込み法は9%の劣化率で比較的頑健である。統計テスト (CochranのQとMcNemar) では、大きな性能差(p < 0.001)が確認された。我々の発見は、現在のKG強化システムにおける脆弱性を浮き彫りにし、幻覚が規制違反や不当な判断につながる可能性のある信頼性の高い金融情報システムを構築するための洞察を提供する。このベンチマークはまた、AI信頼性評価を医療、法務、政府など、他の高度な領域にまたがる情報システム設計に統合するためのフレームワークも提供する。

論文の概要: FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

関連論文リスト