FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
- URL: http://arxiv.org/abs/2411.05764v1
- Date: Fri, 08 Nov 2024 18:26:17 GMT
- Title: FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
- Authors: Yilun Zhao, Yitao Long, Yuru Jiang, Chengye Wang, Weiyuan Chen, Hongjun Liu, Yiming Zhang, Xiangru Tang, Chen Zhao, Arman Cohan,
- Abstract summary: FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning.
Our results show that even the current best-performing system, GPT-4o, still lags behind human experts.
- Score: 30.402524394183857
- License:
- Abstract: We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing common scenarios encountered in real-world financial contexts. We assess a broad spectrum of LLMs under long-context and RAG settings. Our results show that even the current best-performing system, GPT-4o, still lags behind human experts. We further provide in-depth analysis on long-context and RAG setting, Chain-of-Thought reasoning, and model reasoning errors, offering insights to drive future advancements. We believe that FinDVer can serve as a valuable benchmark for evaluating LLMs in claim verification over complex, expert-domain documents.
Related papers
- Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge [5.279257531335345]
This paper details the development of a high-performance, finance-specific Retrieval-Augmented Generation (RAG) system for the ACM-ICAIF '24 FinanceRAG competition.
We optimized performance through ablation studies on query expansion and corpus refinement during the pre-retrieval phase.
Notably, we introduced an efficient method for managing long context sizes during the generation phase, significantly improving response quality without sacrificing performance.
arXiv Detail & Related papers (2024-11-23T09:56:21Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency [13.561104321425045]
Large Language Models (LLMs) have demonstrated remarkable performance in data annotation tasks on general domain datasets.
We investigate the potential of LLMs as efficient data annotators for extracting relations in financial documents.
We demonstrate that the current state-of-the-art LLMs can be sufficient alternatives to non-expert crowdworkers.
arXiv Detail & Related papers (2024-03-26T23:32:52Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models [26.99936434072108]
textttFinDABench is a benchmark designed to evaluate the financial data analysis capabilities of Large Language Models.
textttFinDABench aims to provide a measure for in-depth analysis of LLM abilities.
arXiv Detail & Related papers (2024-01-01T15:26:23Z) - DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple
Experts Fine-tuning [74.99318727786337]
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM)
We build a financial instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation)
Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios.
arXiv Detail & Related papers (2023-10-23T11:33:41Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z) - Enabling and Analyzing How to Efficiently Extract Information from
Hybrid Long Documents with LLMs [48.87627426640621]
This research focuses on harnessing the potential of Large Language Models to comprehend critical information from financial reports.
We propose an Automated Financial Information Extraction framework that enhances LLMs' ability to comprehend and extract information from financial reports.
Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively.
arXiv Detail & Related papers (2023-05-24T10:35:58Z) - FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents.
We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts.
The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.