Related papers: FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance

FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance

URL: http://arxiv.org/abs/2508.05201v1
Date: Thu, 07 Aug 2025 09:37:14 GMT
Title: FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance
Authors: Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang,
Abstract summary: Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance.<n>We develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs.<n>Our work serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.
Score: 0.06597195879147556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.

Related papers

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements [7.259647868714988]
We introduce EDINET-Bench, an open-source Japanese financial benchmark to evaluate the performance of large language models (LLMs)<n>Our experiments reveal that even state-of-the-art LLMs struggle, performing only slightly better than logistic regression in binary classification for fraud detection and earnings forecasting.<n>Our dataset, benchmark construction code, and evaluation code is publicly available to facilitate future research in finance with LLMs.
arXiv Detail & Related papers (2025-06-10T13:03:36Z)
QuantMCP: Grounding Large Language Models in Verifiable Financial Reality [0.43512163406552007]
Large Language Models (LLMs) hold immense promise for revolutionizing financial analysis and decision-making.<n>However, their direct application is often hampered by issues of data hallucination and lack of access to real-time, verifiable financial information.<n>This paper introduces QuantMCP, a novel framework designed to rigorously ground LLMs in financial reality.
arXiv Detail & Related papers (2025-06-07T01:52:39Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
ZiGong 1.0: A Large Language Model for Financial Credit [8.49779245416985]
Large Language Models (LLMs) have demonstrated strong performance across various general Natural Language Processing (NLP) tasks.<n>However, their effectiveness in financial credit assessment applications remains suboptimal.<n>We propose ZiGong, a Mistral-based model enhanced through multi-task supervised fine-tuning.
arXiv Detail & Related papers (2025-02-22T09:27:56Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework [48.3060010653088]
We release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task.
arXiv Detail & Related papers (2024-03-19T09:45:33Z)
FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.