CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2407.02301v1
- Date: Tue, 2 Jul 2024 14:34:36 GMT
- Title: CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models
- Authors: Ying Nie, Binwei Yan, Tianyu Guo, Hao Liu, Haoyu Wang, Wei He, Binfan Zheng, Weihao Wang, Qiang Li, Weijian Sun, Yunhe Wang, Dacheng Tao,
- Abstract summary: CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context.
It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment.
The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
- Score: 61.324062412648075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.
Related papers
- Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [90.67346776473241]
Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data.
We introduce textitOpen-FinLLMs, a series of Financial LLMs that embed comprehensive financial knowledge into text, tables, and time-series data.
We also present FinLLaVA, a multimodal LLM trained with 1.43M image-text instructions to handle complex financial data types.
arXiv Detail & Related papers (2024-08-20T16:15:28Z) - MTFinEval:A Multi-domain Chinese Financial Benchmark with Eurypalynous questions [19.755793171557123]
We have compiled a new benchmark, MTFinEval, focusing on the LLMs' basic knowledge of economics.
MTFinEval comprise 360 questions refined from six major disciplines of economics, and reflect capabilities more comprehensively.
Experiment result shows all LLMs perform poorly on MTFinEval, which proves that our benchmark built on basic knowledge is very successful.
arXiv Detail & Related papers (2024-08-20T15:04:38Z) - Large Language Model in Financial Regulatory Interpretation [0.276240219662896]
This study explores the innovative use of Large Language Models (LLMs) as analytical tools for interpreting complex financial regulations.
The primary objective is to design effective prompts that guide LLMs in distilling verbose and intricate regulatory texts.
This novel approach aims to streamline the implementation of regulatory mandates within the financial reporting and risk management systems of global banking institutions.
arXiv Detail & Related papers (2024-05-10T20:45:40Z) - SuperCLUE-Fin: Graded Fine-Grained Analysis of Chinese LLMs on Diverse Financial Tasks and Applications [17.34850312139675]
SC-Fin is a pioneering evaluation framework tailored for Chinese-native financial large language models (FLMs)
It assesses FLMs across six financial application domains and twenty-five specialized tasks.
Using multi-turn, open-ended conversations that mimic real-life scenarios, SC-Fin measures models on a range of criteria.
arXiv Detail & Related papers (2024-04-29T19:04:35Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - Revolutionizing Finance with LLMs: An Overview of Applications and
Insights [47.11391223936608]
Large Language Models (LLMs) like ChatGPT have seen considerable advancements and have been applied in diverse fields.
These models are being utilized for automating financial report generation, forecasting market trends, analyzing investor sentiment, and offering personalized financial advice.
arXiv Detail & Related papers (2024-01-22T01:06:17Z) - CFBenchmark: Chinese Financial Assistant Benchmark for Large Language Model [22.127509074325324]
Large language models (LLMs) have demonstrated great potential in the financial domain.
In this work, we introduce CFBenchmark, to evaluate the performance of LLMs for Chinese financial assistant.
arXiv Detail & Related papers (2023-11-10T01:12:03Z) - Empowering Many, Biasing a Few: Generalist Credit Scoring through Large
Language Models [53.620827459684094]
Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks.
We propose the first open-source comprehensive framework for exploring LLMs for credit scoring.
We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks.
arXiv Detail & Related papers (2023-10-01T03:50:34Z) - FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for
Large Language Models [25.137098233579255]
FinEval is a benchmark for the financial domain knowledge in the large language models (LLMs)
FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts.
The results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings.
arXiv Detail & Related papers (2023-08-19T10:38:00Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.