Related papers: KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

URL: http://arxiv.org/abs/2504.13216v1
Date: Thu, 17 Apr 2025 00:12:58 GMT
Title: KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding
Authors: Bokwang Hwang, Seonkyu Lim, Taewoong Kim, Yongjae Geun, Sunghyun Bang, Sohyun Park, Jihyun Park, Myeonggyu Lee, Jinwoo Lee, Yerin Kim, Jinsun Yoo, Jingyeong Hong, Jina Park, Yongchan Kim, Suhyun Kim, Younggyun Hahm, Yiseul Lee, Yejee Kang, Chanhyuk Yoon, Chansu Lee, Heeyewon Jeong, Jiyeon Lee, Seonhye Gu, Hyebin Kang, Yousang Cho, Hangyeol Yoo, KyungTae Lim,
Abstract summary: KFinEval-Pilot is a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain.<n>It comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity.
Score: 6.3604109210772934
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.

Related papers

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning [12.548390779247987]
We introduce the Agentar-Fin-R1 series of financial large language models.<n>Our optimization approach integrates a high-quality, systematic financial task label system.<n>Our models undergo comprehensive evaluation on mainstream financial benchmarks.
arXiv Detail & Related papers (2025-07-22T17:52:16Z)
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation [89.73542209537148]
MultiFinBen is the first multilingual and multimodal benchmark tailored to the global financial domain.<n>We introduce two novel tasks, including EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks.<n>We propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark.
arXiv Detail & Related papers (2025-06-16T22:01:49Z)
FinS-Pilot: A Benchmark for Online Financial System [17.65500174763836]
FinS-Pilot is a novel benchmark for evaluating large language models (RAGs) in online financial applications.<n>Our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework.<n>Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems.
arXiv Detail & Related papers (2025-05-31T03:50:19Z)
FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs [15.230256296815565]
FinMaster is a benchmark designed to assess the capabilities of large language models (LLMs) in financial literacy, accounting, auditing, and consulting.<n>FinMaster comprises three main modules: FinSim, FinSuite, and FinEval.<n>Experiments reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 37% on complex scenarios.
arXiv Detail & Related papers (2025-05-18T11:47:55Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models [0.0]
FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work.<n>Current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks.<n>Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API.
arXiv Detail & Related papers (2025-01-30T00:06:55Z)
FLAME: Financial Large-Language Model Assessment and Metrics Evaluation [2.6420673380196824]
We introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese. FLAME-Cer covers 14 types of authoritative financial certifications, with a total of approximately 16,000 carefully selected questions. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks.
arXiv Detail & Related papers (2025-01-03T09:17:23Z)
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models [22.594428755214356]
"Golden Touchstone" is the first comprehensive bilingual benchmark for financial LLMs. benchmarks include a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. We open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning.
arXiv Detail & Related papers (2024-11-09T20:09:11Z)
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [88.96861155804935]
We introduce textitOpen-FinLLMs, the first open-source multimodal financial LLMs.<n>FinLLaMA is pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs.<n>We evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings.
arXiv Detail & Related papers (2024-08-20T16:15:28Z)
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context. It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z)
Financial Knowledge Large Language Model [4.599537455808687]
We introduce IDEA-FinBench, an evaluation benchmark for assessing financial knowledge in large language models (LLMs) We propose IDEA-FinKER, a framework designed to facilitate the rapid adaptation of general LLMs to the financial domain. Finally, we present IDEA-FinQA, a financial question-answering system powered by LLMs.
arXiv Detail & Related papers (2024-06-29T08:26:49Z)
SuperCLUE-Fin: Graded Fine-Grained Analysis of Chinese LLMs on Diverse Financial Tasks and Applications [17.34850312139675]
SC-Fin is a pioneering evaluation framework tailored for Chinese-native financial large language models (FLMs) It assesses FLMs across six financial application domains and twenty-five specialized tasks. Using multi-turn, open-ended conversations that mimic real-life scenarios, SC-Fin measures models on a range of criteria.
arXiv Detail & Related papers (2024-04-29T19:04:35Z)
No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks [75.29561463156635]
ICE-PIXIU uniquely integrates a spectrum of Chinese tasks, alongside translated and original English datasets. It provides unrestricted access to diverse model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations.
arXiv Detail & Related papers (2024-03-10T16:22:20Z)
FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models [31.961563103990432]
This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities.<n>The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent.<n>Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting.
arXiv Detail & Related papers (2023-08-19T10:38:00Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.