BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
- URL: http://arxiv.org/abs/2602.17072v2
- Date: Thu, 26 Feb 2026 06:36:24 GMT
- Title: BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
- Authors: Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo,
- Abstract summary: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain.<n>These models still exhibit low accuracy in core banking computations.<n>BankMathBench is a domain-specific dataset that reflects realistic banking tasks.
- Score: 45.48548225665319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.
Related papers
- Error-Driven Prompt Optimization for Arithmetic Reasoning [0.0]
We introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA)<n>We find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions, dramatically improves performance.<n>Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization.
arXiv Detail & Related papers (2025-12-15T13:39:14Z) - Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches [57.5863675268117]
Large Language Models (LLMs) are increasingly employed in high-stakes decision-making tasks, such as loan approvals.<n>We assess the performance and fairness of LLMs on serialized loan approval datasets from Ghana, Germany, and the United States.
arXiv Detail & Related papers (2025-08-29T10:51:41Z) - FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z) - FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z) - LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data [7.317765812144531]
We present a benchmark designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB)<n>Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data.<n>The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times.
arXiv Detail & Related papers (2025-02-13T10:56:58Z) - Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges [0.0]
We introduce GSM-Ranges, a dataset generator that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales.<n>We also propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy.
arXiv Detail & Related papers (2025-02-12T09:53:10Z) - Advanced User Credit Risk Prediction Model using LightGBM, XGBoost and Tabnet with SMOTEENN [8.225603728650478]
We use a dataset of over 40,000 records provided by a commercial bank as the research object.
Experiments demonstrated that LightGBM combined with PCA and SMOTEENN techniques can assist banks in accurately predicting potential high-quality customers.
arXiv Detail & Related papers (2024-08-07T01:37:10Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Bank transactions embeddings help to uncover current macroeconomics [0.8029971974118232]
We use clients' financial transactions data from a large Russian bank to get macroeconomic indexes.
We develop an efficient approach that allows fast and accurate estimation of macroeconomic indexes based on a stream of transactions.
arXiv Detail & Related papers (2021-10-14T14:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.