Related papers: FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

URL: http://arxiv.org/abs/2506.02515v1
Date: Tue, 03 Jun 2025 06:44:42 GMT
Title: FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Authors: Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi Georgiev, Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh, Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev, Tanmoy Chakraborty, Salem Lahlou, Veselin Stoyanov, Preslav Nakov,
Abstract summary: FinChain is the first symbolic benchmark for verifiable Chain-of- Thought (CoT) financial reasoning.<n>FinChain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required.<n> Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement.
Score: 43.74670894224625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.

Related papers

FinS-Pilot: A Benchmark for Online Financial System [17.65500174763836]
FinS-Pilot is a novel benchmark for evaluating large language models (RAGs) in online financial applications.<n>Our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework.<n>Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems.
arXiv Detail & Related papers (2025-05-31T03:50:19Z)
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain [6.275468311396066]
Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks.<n>We introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields.<n>We evaluate 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen.
arXiv Detail & Related papers (2025-04-18T16:40:39Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
Demystifying Domain-adaptive Post-training for Financial LLMs [79.581577578952]
FINDAP is a systematic and fine-grained investigation into domain adaptive post-training of large language models (LLMs)<n>Our approach consists of four key components: FinCap, FinRec, FinTrain and FinEval.<n>The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks.
arXiv Detail & Related papers (2025-01-09T04:26:15Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models [18.280762424107408]
FinTral is a suite of state-of-the-art multimodal large language models (LLMs) built upon the Mistral-7b model. We enhance FinTral with domain-specific pretraining, instruction fine-tuning, and RLAIF training. Our FinTral model trained with direct preference optimization employing advanced Tools and Retrieval methods, dubbed FinTral-DPO-T&R, demonstrates an exceptional zero-shot performance.
arXiv Detail & Related papers (2024-02-16T05:05:12Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
FinTree: Financial Dataset Pretrain Transformer Encoder for Relation Extraction [0.0]
We pretrain FinTree on the financial dataset, adapting the model in financial tasks. FinTree stands out with its novel structure that predicts a masked token instead of the conventional domain [an] token. Our experiments demonstrate that FinTree outperforms on the REFinD, a large-scale financial relation extraction dataset.
arXiv Detail & Related papers (2023-07-26T01:48:52Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain [42.093876880881886]
We propose a novel domain specific Financial LANGuage model (FLANG) It uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Our models, code and benchmark data are publicly available on Github and Huggingface.
arXiv Detail & Related papers (2022-10-31T18:35:18Z)
FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.