FinanceBench: A New Benchmark for Financial Question Answering
- URL: http://arxiv.org/abs/2311.11944v1
- Date: Mon, 20 Nov 2023 17:28:02 GMT
- Title: FinanceBench: A New Benchmark for Financial Question Answering
- Authors: Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino
Scherrer, Bertie Vidgen
- Abstract summary: FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA)
It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings.
We test 16 state of the art model configurations on a sample of 150 cases from FinanceBench, and manually review their answers.
- Score: 28.865821741574237
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: FinanceBench is a first-of-its-kind test suite for evaluating the performance
of LLMs on open book financial question answering (QA). It comprises 10,231
questions about publicly traded companies, with corresponding answers and
evidence strings. The questions in FinanceBench are ecologically valid and
cover a diverse set of scenarios. They are intended to be clear-cut and
straightforward to answer to serve as a minimum performance standard. We test
16 state of the art model configurations (including GPT-4-Turbo, Llama2 and
Claude2, with vector stores and long context prompts) on a sample of 150 cases
from FinanceBench, and manually review their answers (n=2,400). The cases are
available open-source. We show that existing LLMs have clear limitations for
financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly
answered or refused to answer 81% of questions. While augmentation techniques
such as using longer context window to feed in relevant evidence improve
performance, they are unrealistic for enterprise settings due to increased
latency and cannot support larger financial documents. We find that all models
examined exhibit weaknesses, such as hallucinations, that limit their
suitability for use by enterprises.
Related papers
- MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning [42.80085792749683]
We propose MME-Finance, an open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark.
The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users.
In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
arXiv Detail & Related papers (2024-11-05T18:59:51Z) - FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering [22.245216871611678]
FAMMA is an open-source benchmark for financial multilingual multimodal question answering.
It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams.
We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models.
arXiv Detail & Related papers (2024-10-06T15:41:26Z) - Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [90.67346776473241]
Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data.
We introduce textitOpen-FinLLMs, a series of Financial LLMs that embed comprehensive financial knowledge into text, tables, and time-series data.
We also present FinLLaVA, a multimodal LLM trained with 1.43M image-text instructions to handle complex financial data types.
arXiv Detail & Related papers (2024-08-20T16:15:28Z) - CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context.
It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment.
The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z) - MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations [105.10376440302076]
This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions.
It is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens.
Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models.
arXiv Detail & Related papers (2024-07-01T17:59:26Z) - SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation [50.061029816288936]
We present SciFIBench, a scientific figure interpretation benchmark.
Our main benchmark consists of a 1000-question gold set of multiple-choice questions split between two tasks across 12 categories.
The questions are curated from CS arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control.
We evaluate 26 LMMs on SciFIBench, finding it to be a challenging benchmark.
arXiv Detail & Related papers (2024-05-14T17:54:17Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - InvestLM: A Large Language Model for Investment using Financial Domain
Instruction Tuning [19.22852919096857]
We present a new financial domain large language model, InvestLM, tuned on LLaMA-65B (Touvron et al., 2023)
Inspired by less-is-more-for-alignment, we manually curate a small yet diverse instruction dataset, covering a wide range of financial related topics.
InvestLM shows strong capabilities in understanding financial text and provides helpful responses to investment related questions.
arXiv Detail & Related papers (2023-09-15T02:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.