Related papers: Automating Financial Statement Audits with Large Language Models

Automating Financial Statement Audits with Large Language Models

URL: http://arxiv.org/abs/2506.17282v1
Date: Sat, 14 Jun 2025 17:07:06 GMT
Title: Automating Financial Statement Audits with Large Language Models
Authors: Rushi Wang, Jiateng Liu, Weijie Zhao, Shenglan Li, Denghui Zhang,
Abstract summary: We harness large language models (LLMs) to automate financial statement auditing.<n>Our work introduces a benchmark using a curated dataset combining real-world financial tables with synthesized transaction data.<n>Our testing reveals that current state-of-the-art LLMs successfully identify financial statement errors when given historical transaction data.
Score: 8.568971444669868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Financial statement auditing is essential for stakeholders to understand a company's financial health, yet current manual processes are inefficient and error-prone. Even with extensive verification procedures, auditors frequently miss errors, leading to inaccurate financial statements that fail to meet stakeholder expectations for transparency and reliability. To this end, we harness large language models (LLMs) to automate financial statement auditing and rigorously assess their capabilities, providing insights on their performance boundaries in the scenario of automated auditing. Our work introduces a comprehensive benchmark using a curated dataset combining real-world financial tables with synthesized transaction data. In the benchmark, we developed a rigorous five-stage evaluation framework to assess LLMs' auditing capabilities. The benchmark also challenges models to map specific financial statement errors to corresponding violations of accounting standards, simulating real-world auditing scenarios through test cases. Our testing reveals that current state-of-the-art LLMs successfully identify financial statement errors when given historical transaction data. However, these models demonstrate significant limitations in explaining detected errors and citing relevant accounting standards. Furthermore, LLMs struggle to execute complete audits and make necessary financial statement revisions. These findings highlight a critical gap in LLMs' domain-specific accounting knowledge. Future research must focus on enhancing LLMs' understanding of auditing principles and procedures. Our benchmark and evaluation framework establish a foundation for developing more effective automated auditing tools that will substantially improve the accuracy and efficiency of real-world financial statement auditing.

Related papers

FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance [0.06597195879147556]
Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance.<n>We develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs.<n>Our work serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.
arXiv Detail & Related papers (2025-08-07T09:37:14Z)
Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation [3.077814260904367]
We propose FinAR-Bench, a benchmark dataset focusing on financial statement analysis.<n>We break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning.<n>Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis.
arXiv Detail & Related papers (2025-05-22T07:06:20Z)
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation [63.55583665003167]
We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance.<n>FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets.<n>By challenging models to retrieve relevant information from large corpora, FinDER offers a more realistic benchmark for evaluating RAG systems.
arXiv Detail & Related papers (2025-04-22T11:30:13Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models [0.0]
FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work.<n>Current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks.<n>Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API.
arXiv Detail & Related papers (2025-01-30T00:06:55Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework [48.3060010653088]
We release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task.
arXiv Detail & Related papers (2024-03-19T09:45:33Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
Federated Continual Learning to Detect Accounting Anomalies in Financial Auditing [1.2205797997133396]
We propose a Federated Continual Learning framework enabling auditors to learn audit models from decentral clients continuously. We evaluate the framework's ability to detect accounting anomalies in common scenarios of organizational activity.
arXiv Detail & Related papers (2022-10-26T21:33:08Z)
Multi-view Contrastive Self-Supervised Learning of Accounting Data Representations for Downstream Audit Tasks [1.9659095632676094]
International audit standards require the direct assessment of a financial statement's underlying accounting transactions, referred to as journal entries. Deep learning inspired audit techniques have emerged in the field of auditing vast quantities of journal entry data. We propose a contrastive self-supervised learning framework designed to learn audit task invariant accounting data representations.
arXiv Detail & Related papers (2021-09-23T08:16:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.