FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
- URL: http://arxiv.org/abs/2509.13160v1
- Date: Tue, 16 Sep 2025 15:13:13 GMT
- Title: FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
- Authors: Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang,
- Abstract summary: FinSearchComp is the first fully open-source benchmark for realistic, open-domain financial search and reasoning.<n>The benchmark includes 635 questions spanning global and Greater China markets.<n>Grok 4 (web) tops the global subset, approaching expert-level accuracy.
- Score: 30.252244345152956
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
Related papers
- BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment [12.163992099059461]
We introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets.<n>We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks, totaling 29,578 expert-level Q&A pairs.<n>ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists.<n>In online tasks, DeepSeek-R1 outperforms all other commercial LLMs.
arXiv Detail & Related papers (2026-01-10T02:51:53Z) - FinSight: Towards Real-World Financial Deep Research [68.31086471310773]
FinSight is a novel framework for producing high-quality, multimodal financial reports.<n>To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism.<n>A two-stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports.
arXiv Detail & Related papers (2025-10-19T14:05:35Z) - FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis [110.5695516127813]
HisRubric is a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric.<n>FinDeepResearch is a benchmark that comprises 64 listed companies from 8 financial markets across 4 languages.<n>We conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only.
arXiv Detail & Related papers (2025-10-15T17:21:56Z) - FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning [10.985136487771364]
FinMR is a knowledge-intensive multimodal dataset designed to evaluate expert-level financial reasoning capabilities at a professional analyst's standard.<n>It comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics.<n>FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.
arXiv Detail & Related papers (2025-10-09T06:49:55Z) - FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance.<n>The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z) - FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs [15.230256296815565]
FinMaster is a benchmark designed to assess the capabilities of large language models (LLMs) in financial literacy, accounting, auditing, and consulting.<n>FinMaster comprises three main modules: FinSim, FinSuite, and FinEval.<n>Experiments reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 37% on complex scenarios.
arXiv Detail & Related papers (2025-05-18T11:47:55Z) - FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation [65.04104723843264]
We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance.<n>FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets.<n>By challenging models to retrieve relevant information from large corpora, FinDER offers a more realistic benchmark for evaluating RAG systems.
arXiv Detail & Related papers (2025-04-22T11:30:13Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents.
We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts.
The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.