FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains
- URL: http://arxiv.org/abs/2311.09797v2
- Date: Thu, 8 Aug 2024 15:45:11 GMT
- Title: FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains
- Authors: Yilun Zhao, Hongjun Liu, Yitao Long, Rui Zhang, Chen Zhao, Arman Cohan,
- Abstract summary: We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving math reasoning problems.
FinanceMath includes 1,200 problems with a hybrid of textual and tabular content.
We construct a finance-domain knowledge bank and investigate various knowledge integration strategies.
- Score: 31.71662323881496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks.
Related papers
- ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents [38.51865513988743]
This paper introduces DocMath-Eval, a benchmark to evaluate the numerical reasoning capabilities of LLMs.
We conduct an evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods.
We find that even the best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts.
arXiv Detail & Related papers (2023-11-16T11:30:53Z) - FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for
Large Language Models [25.137098233579255]
FinEval is a benchmark for the financial domain knowledge in the large language models (LLMs)
FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts.
The results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings.
arXiv Detail & Related papers (2023-08-19T10:38:00Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z) - Systematic Review on Reinforcement Learning in the Field of Fintech [0.36832029288386137]
The objective of this systematic survey is to perform an exploratory study on a correlation between reinforcement learning and complexity.
The use of RL-based strategies in fields proves to perform considerably better than other state-of-the-art algorithms.
The organizations dealing with finance can benefit greatly from smart order channelling, market making, hedging and options, pricing, portfolio optimization, and optimal execution.
arXiv Detail & Related papers (2023-04-29T07:48:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.