Related papers: FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

URL: http://arxiv.org/abs/2510.08886v1
Date: Fri, 10 Oct 2025 00:41:55 GMT
Title: FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Authors: Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie,
Abstract summary: FinAuditing is the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating financial auditing tasks.<n>Built from real US-compliant.<n> filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency.<n>Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions.
Score: 40.216867348210265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

Related papers

Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents [0.21485350418225244]
Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems.<n>Structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs.<n>This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks.
arXiv Detail & Related papers (2026-01-12T17:39:08Z)
FinSight: Towards Real-World Financial Deep Research [68.31086471310773]
FinSight is a novel framework for producing high-quality, multimodal financial reports.<n>To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism.<n>A two-stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports.
arXiv Detail & Related papers (2025-10-19T14:05:35Z)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z)
Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study [1.6770212301915661]
This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA.<n>We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized.
arXiv Detail & Related papers (2025-08-29T06:13:21Z)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z)
FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance [3.565466729914703]
Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance.<n>We develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs.<n>Our work serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.
arXiv Detail & Related papers (2025-08-07T09:37:14Z)
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information [47.37027539828975]
FinTagging is the first comprehensive benchmark for structure-aware and full-scope.<n>tagging.<n> FinNI for numeric identification extracts numerical entities and their types from.<n>financial reports.<n>FinCL for concept linking maps each extracted entity to the corresponding concept in the full U.S. taxonomy.
arXiv Detail & Related papers (2025-05-27T02:55:53Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.