Related papers: FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

URL: http://arxiv.org/abs/2501.06211v1
Date: Fri, 03 Jan 2025 09:17:23 GMT
Title: FLAME: Financial Large-Language Model Assessment and Metrics Evaluation
Authors: Jiayu Guo, Yu Guo, Martha Li, Songtao Tan,
Abstract summary: We introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese.<n>FLAME-Cer covers 14 types of authoritative financial certifications, with a total of approximately 16,000 carefully selected questions.<n>FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks.
Score: 2.6420673380196824
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LLMs have revolutionized NLP and demonstrated potential across diverse domains. More and more financial LLMs have been introduced for finance-specific tasks, yet comprehensively assessing their value is still challenging. In this paper, we introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese, which includes two core evaluation benchmarks: FLAME-Cer and FLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications, including CPA, CFA, and FRM, with a total of approximately 16,000 carefully selected questions. All questions have been manually reviewed to ensure accuracy and representativeness. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0, Qwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing Baichuan4-Finance excels other LLMs in most tasks. By establishing a comprehensive and professional evaluation system, FLAME facilitates the advancement of financial LLMs in Chinese contexts. Instructions for participating in the evaluation are available on GitHub: https://github.com/FLAME-ruc/FLAME.

Related papers

KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding [6.3604109210772934]
KFinEval-Pilot is a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. It comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity.
arXiv Detail & Related papers (2025-04-17T00:12:58Z)
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [90.67346776473241]
Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data. We introduce textitOpen-FinLLMs, a series of Financial LLMs that embed comprehensive financial knowledge into text, tables, and time-series data. We also present FinLLaVA, a multimodal LLM trained with 1.43M image-text instructions to handle complex financial data types.
arXiv Detail & Related papers (2024-08-20T16:15:28Z)
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context. It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z)
SuperCLUE-Fin: Graded Fine-Grained Analysis of Chinese LLMs on Diverse Financial Tasks and Applications [17.34850312139675]
SC-Fin is a pioneering evaluation framework tailored for Chinese-native financial large language models (FLMs) It assesses FLMs across six financial application domains and twenty-five specialized tasks. Using multi-turn, open-ended conversations that mimic real-life scenarios, SC-Fin measures models on a range of criteria.
arXiv Detail & Related papers (2024-04-29T19:04:35Z)
FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
D\'olares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs Between Spanish and English [67.48541936784501]
Tois'on de Oro is the first framework that establishes instruction datasets, finetuned LLMs, and evaluation benchmark for financial LLMs in Spanish joint with English. We construct a rigorously curated bilingual instruction dataset including over 144K Spanish and English samples from 15 datasets covering 7 tasks. We evaluate our model and existing LLMs using FLARE-ES, the first comprehensive bilingual evaluation benchmark with 21 datasets covering 9 tasks.
arXiv Detail & Related papers (2024-02-12T04:50:31Z)
Revolutionizing Finance with LLMs: An Overview of Applications and Insights [45.660896719456886]
Large Language Models (LLMs) like ChatGPT have seen considerable advancements and have been applied in diverse fields.<n>These models are being utilized for automating financial report generation, forecasting market trends, analyzing investor sentiment, and offering personalized financial advice.
arXiv Detail & Related papers (2024-01-22T01:06:17Z)
DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning [74.99318727786337]
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM) We build a financial instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation) Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios.
arXiv Detail & Related papers (2023-10-23T11:33:41Z)
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models [31.961563103990432]
This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities.<n>The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent.<n>Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting.
arXiv Detail & Related papers (2023-08-19T10:38:00Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.