Related papers: FinBen: A Holistic Financial Benchmark for Large Language Models

FinBen: A Holistic Financial Benchmark for Large Language Models

URL: http://arxiv.org/abs/2402.12659v2
Date: Wed, 19 Jun 2024 03:38:56 GMT
Title: FinBen: A Holistic Financial Benchmark for Large Language Models
Authors: Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, Jiajia Huang, Xiao-Yang Liu, Alejandro Lopez-Lira, Benyou Wang, Yanzhao Lai, Hao Wang, Min Peng, Sophia Ananiadou, Jimin Huang,
Abstract summary: FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
Score: 75.09474986283394
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.

Related papers

FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain [54.06289302468199]
FinTrust is a benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications.<n> proprietary models like o4-mini outperforms in most tasks such as safety.<n>Open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness.
arXiv Detail & Related papers (2025-10-17T01:45:49Z)
XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning [28.967959142733903]
We introduce XFinBench, a novel benchmark to evaluate large language models' ability in solving financial problems.<n>O1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%.<n>We construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge only brings consistent accuracy improvements to small open-source model.
arXiv Detail & Related papers (2025-08-20T15:23:35Z)
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application [118.63802040274999]
MultiFinBen is the first expert-annotated multilingual (five languages) and multimodal benchmark for evaluating LLMs in realistic financial contexts.<n>Financial reasoning tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents.<n> evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings.
arXiv Detail & Related papers (2025-06-16T22:01:49Z)
FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs [15.230256296815565]
FinMaster is a benchmark designed to assess the capabilities of large language models (LLMs) in financial literacy, accounting, auditing, and consulting.<n>FinMaster comprises three main modules: FinSim, FinSuite, and FinEval.<n>Experiments reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 37% on complex scenarios.
arXiv Detail & Related papers (2025-05-18T11:47:55Z)
FLAME: Financial Large-Language Model Assessment and Metrics Evaluation [2.6420673380196824]
We introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese. FLAME-Cer covers 14 types of authoritative financial certifications, with a total of approximately 16,000 carefully selected questions. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks.
arXiv Detail & Related papers (2025-01-03T09:17:23Z)
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [90.67346776473241]
Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data. We introduce textitOpen-FinLLMs, a series of Financial LLMs that embed comprehensive financial knowledge into text, tables, and time-series data. We also present FinLLaVA, a multimodal LLM trained with 1.43M image-text instructions to handle complex financial data types.
arXiv Detail & Related papers (2024-08-20T16:15:28Z)
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context. It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z)
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework [48.3060010653088]
We release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task.
arXiv Detail & Related papers (2024-03-19T09:45:33Z)
A Survey of Large Language Models in Finance (FinLLMs) [10.195778659105626]
Large Language Models (LLMs) have shown remarkable capabilities across a wide variety of Natural Language Processing (NLP) tasks. This survey provides a comprehensive overview of FinLLMs, including their history, techniques, performance, and opportunities and challenges. To support AI research in finance, we compile a collection of accessible datasets and evaluation benchmarks on GitHub.
arXiv Detail & Related papers (2024-02-04T02:06:57Z)
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models [25.137098233579255]
FinEval is a benchmark for the financial domain knowledge in the large language models (LLMs) FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts. The results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings.
arXiv Detail & Related papers (2023-08-19T10:38:00Z)
FinGPT: Democratizing Internet-scale Data for Financial Large Language Models [35.83244096535722]
Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts. Financial Generative Pre-trained Transformer (FinGPT) automates the collection and curation of real-time financial data from 34 diverse sources on the Internet. FinGPT aims to democratize FinLLMs, stimulate innovation, and unlock new opportunities in open finance.
arXiv Detail & Related papers (2023-07-19T22:43:57Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.