Related papers: RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

URL: http://arxiv.org/abs/2602.07096v1
Date: Fri, 06 Feb 2026 13:47:54 GMT
Title: RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?
Authors: Yuyang Dai, Yan Lin, Zhuohan Xie, Yuxia Wang,
Abstract summary: We introduce REALFIN, a benchmark that evaluates financial reasoning by systematically removing essential premises from exam-style questions.<n>General-purpose models tend to over-commit and guess, while most finance-specialized models fail to clearly identify missing premises.<n>These results highlight a critical gap in current evaluations and show that reliable financial models must know when a question should not be answered.
Score: 15.081940501866844
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Reliable financial reasoning requires knowing not only how to answer, but also when an answer cannot be justified. In real financial practice, problems often rely on implicit assumptions that are taken for granted rather than stated explicitly, causing problems to appear solvable while lacking enough information for a definite answer. We introduce REALFIN, a bilingual benchmark that evaluates financial reasoning by systematically removing essential premises from exam-style questions while keeping them linguistically plausible. Based on this, we evaluate models under three formulations that test answering, recognizing missing information, and rejecting unjustified options, and find consistent performance drops when key conditions are absent. General-purpose models tend to over-commit and guess, while most finance-specialized models fail to clearly identify missing premises. These results highlight a critical gap in current evaluations and show that reliable financial models must know when a question should not be answered.

Related papers

Evaluating LLMs in Finance Requires Explicit Bias Consideration [88.38155218924999]
Finance-specific biases can inflate performance, contaminate backtests, and make reported results useless for deployment claims.<n>No single bias is discussed in more than 28 percent of studies.<n>We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design.
arXiv Detail & Related papers (2026-02-15T17:02:01Z)
Knowing What's Missing: Assessing Information Sufficiency in Question Answering [3.8786514101828167]
We propose a structured Identify-then-Verify framework for robust sufficiency modeling.<n>We evaluate our method against established baselines across diverse multi-hop and factual QA datasets.
arXiv Detail & Related papers (2025-12-06T15:58:22Z)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z)
XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning [28.967959142733903]
We introduce XFinBench, a novel benchmark to evaluate large language models' ability in solving financial problems.<n>O1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%.<n>We construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge only brings consistent accuracy improvements to small open-source model.
arXiv Detail & Related papers (2025-08-20T15:23:35Z)
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions [32.871820908561936]
AbstentionBench is a benchmark for holistically evaluating abstention across 20 diverse datasets.<n>We find that reasoning fine-tuning degrades abstention even for math and science domains.
arXiv Detail & Related papers (2025-06-10T17:57:30Z)
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning [82.7292329605713]
FinChain is the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance.<n>It spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces.<n>FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
arXiv Detail & Related papers (2025-06-03T06:44:42Z)
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation [65.04104723843264]
We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance.<n>FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets.<n>By challenging models to retrieve relevant information from large corpora, FinDER offers a more realistic benchmark for evaluating RAG systems.
arXiv Detail & Related papers (2025-04-22T11:30:13Z)
Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach [6.911426601915051]
This paper introduces a new benchmark designed to evaluate how well AI models - especially large language and multimodal models - reason in finance-specific contexts.<n>We propose an error-aware learning framework that leverages historical model mistakes and feedback to guide inference, without requiring fine-tuning.<n>The results highlight persistent challenges in visual understanding and mathematical logic, while also demonstrating the promise of self-reflective reasoning in financial AI systems.
arXiv Detail & Related papers (2025-04-22T07:25:03Z)
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established.<n>This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt.<n>We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.