SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities
- URL: http://arxiv.org/abs/2504.04596v1
- Date: Sun, 06 Apr 2025 19:59:41 GMT
- Title: SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities
- Authors: Noga Ben Yoash, Meni Brief, Oded Ovadia, Gil Shenderovitz, Moshik Mishaeli, Rachel Lemberg, Eitam Sheetrit,
- Abstract summary: SECQUE comprises 565 expert-written questions covering SEC filings analysis across four key categories.<n>To assess model performance, we develop SECQUE-Judge, an evaluation mechanism leveraging multiple LLM-based judges.
- Score: 0.31410859223862103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SECQUE, a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks. SECQUE comprises 565 expert-written questions covering SEC filings analysis across four key categories: comparison analysis, ratio calculation, risk assessment, and financial insight generation. To assess model performance, we develop SECQUE-Judge, an evaluation mechanism leveraging multiple LLM-based judges, which demonstrates strong alignment with human evaluations. Additionally, we provide an extensive analysis of various models' performance on our benchmark. By making SECQUE publicly available, we aim to facilitate further research and advancements in financial AI.
Related papers
- FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation [16.096968833930152]
We introduce FIRE, a benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios.<n>For theoretical assessment, we curate a diverse set of examination questions drawn from widely recognized financial qualification exams.<n>To assess the practical value of LLMs in real-world financial tasks, we propose a systematic evaluation matrix that categorizes complex financial domains.
arXiv Detail & Related papers (2026-02-25T08:53:56Z) - FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis [110.5695516127813]
HisRubric is a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric.<n>FinDeepResearch is a benchmark that comprises 64 listed companies from 8 financial markets across 4 languages.<n>We conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only.
arXiv Detail & Related papers (2025-10-15T17:21:56Z) - FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z) - Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation [3.077814260904367]
We propose FinAR-Bench, a benchmark dataset focusing on financial statement analysis.<n>We break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning.<n>Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis.
arXiv Detail & Related papers (2025-05-22T07:06:20Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.
FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z) - FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models [0.0]
FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work.<n>Current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks.<n>Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API.
arXiv Detail & Related papers (2025-01-30T00:06:55Z) - FinSphere: A Conversational Stock Analysis Agent Equipped with Quantitative Tools based on Real-Time Database [7.268553732731626]
FinSphere is a conversational stock analysis agent.<n>An integrated framework combines real-time data feeds, quantitative tools, and an instruction-tuned LLM.
arXiv Detail & Related papers (2025-01-08T07:50:50Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context.
It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment.
The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z) - SuperCLUE-Fin: Graded Fine-Grained Analysis of Chinese LLMs on Diverse Financial Tasks and Applications [17.34850312139675]
SC-Fin is a pioneering evaluation framework tailored for Chinese-native financial large language models (FLMs)
It assesses FLMs across six financial application domains and twenty-five specialized tasks.
Using multi-turn, open-ended conversations that mimic real-life scenarios, SC-Fin measures models on a range of criteria.
arXiv Detail & Related papers (2024-04-29T19:04:35Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - InFoBench: Evaluating Instruction Following Ability in Large Language
Models [57.27152890085759]
Decomposed Requirements Following Ratio (DRFR) is a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions.
We present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories.
arXiv Detail & Related papers (2024-01-07T23:01:56Z) - FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models [26.99936434072108]
textttFinDABench is a benchmark designed to evaluate the financial data analysis capabilities of Large Language Models.
textttFinDABench aims to provide a measure for in-depth analysis of LLM abilities.
arXiv Detail & Related papers (2024-01-01T15:26:23Z) - Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4
on mock CFA Exams [26.318005637849915]
This study aims at assessing the financial reasoning capabilities of Large Language Models (LLMs)
We leverage mock exam questions of the Chartered Financial Analyst (CFA) Program to conduct a comprehensive evaluation of ChatGPT and GPT-4.
We present an in-depth analysis of the models' performance and limitations, and estimate whether they would have a chance at passing the CFA exams.
arXiv Detail & Related papers (2023-10-12T19:28:57Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.