Evaluating LLMs' Mathematical Reasoning in Financial Document Question
Answering
- URL: http://arxiv.org/abs/2402.11194v2
- Date: Thu, 29 Feb 2024 09:13:58 GMT
- Title: Evaluating LLMs' Mathematical Reasoning in Financial Document Question
Answering
- Authors: Pragya Srivastava, Manuj Malik, Vivek Gupta, Tanuja Ganu, Dan Roth
- Abstract summary: This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets.
We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps.
We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
- Score: 53.56653281752486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), excel in natural language understanding, but
their capability for complex mathematical reasoning with an amalgamation of
structured tables and unstructured text is uncertain. This study explores LLMs'
mathematical reasoning on four financial tabular question-answering datasets:
TATQA, FinQA, ConvFinQA, and Multihiertt. Through extensive experiments with
various models and prompting techniques, we assess how LLMs adapt to complex
tables and mathematical tasks. We focus on sensitivity to table complexity and
performance variations with an increasing number of arithmetic reasoning steps.
The results provide insights into LLMs' capabilities and limitations in
handling complex mathematical scenarios for semi-structured tables. Ultimately,
we introduce a novel prompting technique tailored to semi-structured documents,
matching or outperforming other baselines in performance while providing a
nuanced understanding of LLMs abilities for such a task.
Related papers
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem.
Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints.
We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks.
We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios.
We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z) - Investigating Symbolic Capabilities of Large Language Models [16.88906206735967]
This study aims to bridge the gap by rigorously evaluating Large Language Models (LLMs) on a series of symbolic tasks.
Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks.
The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases.
arXiv Detail & Related papers (2024-05-21T21:24:34Z) - Benchmarking the Text-to-SQL Capability of Large Language Models: A
Comprehensive Evaluation [33.41556606816004]
Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to- task.
There is still no consensus on the optimal prompt templates and design frameworks.
Existing benchmarks inadequately explore the performance of LLMs across the various sub-tasks of the Text-to- process.
arXiv Detail & Related papers (2024-03-05T13:23:48Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.