FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
- URL: http://arxiv.org/abs/2504.16188v1
- Date: Tue, 22 Apr 2025 18:25:17 GMT
- Title: FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
- Authors: Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Charese H. Smiley,
- Abstract summary: FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts.<n>The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively.
- Score: 7.091661099911842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.
Related papers
- MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation [89.73542209537148]
MultiFinBen is the first multilingual and multimodal benchmark tailored to the global financial domain.<n>We introduce two novel tasks, including EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks.<n>We propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark.
arXiv Detail & Related papers (2025-06-16T22:01:49Z) - M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset [18.752133381125564]
$texttM$3$FinMeeting$ is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding.<n>First, it supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts.<n>Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS)<n>Third, it includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding.
arXiv Detail & Related papers (2025-06-03T06:41:09Z) - FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models [24.430050834440998]
FinBERT2 is a specialized bidirectional encoder pretrained on a high-quality, financial-specific corpus of 32b tokens.<n>Discriminative fine-tuned models (Fin-Labelers) outperform other (Fin)BERT variants by 0.4%-3.3% and leading LLMs by 9.7%-12.3% on average across five financial classification tasks.<n>Fin-TopicModel enables superior clustering and topic representation for financial titles.
arXiv Detail & Related papers (2025-05-31T13:59:44Z) - FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs [15.230256296815565]
FinMaster is a benchmark designed to assess the capabilities of large language models (LLMs) in financial literacy, accounting, auditing, and consulting.<n>FinMaster comprises three main modules: FinSim, FinSuite, and FinEval.<n>Experiments reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 37% on complex scenarios.
arXiv Detail & Related papers (2025-05-18T11:47:55Z) - Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications [88.96861155804935]
We introduce textitOpen-FinLLMs, the first open-source multimodal financial LLMs.<n>FinLLaMA is pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs.<n>We evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings.
arXiv Detail & Related papers (2024-08-20T16:15:28Z) - SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models [6.639972934967109]
Large language models (LLMs) have become powerful tools for advancing natural language processing applications in the financial industry.
We propose a novel large language model specifically designed for the Chinese financial domain, named SNFinLLM.
SNFinLLM excels in domain-specific tasks such as answering questions, summarizing financial research reports, analyzing sentiment, and executing financial calculations.
arXiv Detail & Related papers (2024-08-05T08:24:24Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - NumLLM: Numeric-Sensitive Large Language Model for Chinese Finance [15.662858834827444]
FinLLMs exhibit unsatisfactory performance in understanding financial text when numeric variables are involved in questions.
We propose numeric-sensitive large language model (NumLLM) for Chinese finance.
Experiments on financial question-answering benchmark show that NumLLM can boost the performance of the foundation model.
arXiv Detail & Related papers (2024-05-01T15:17:27Z) - AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework [48.3060010653088]
We release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data.
We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task.
arXiv Detail & Related papers (2024-03-19T09:45:33Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - D\'olares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs
Between Spanish and English [67.48541936784501]
Tois'on de Oro is the first framework that establishes instruction datasets, finetuned LLMs, and evaluation benchmark for financial LLMs in Spanish joint with English.
We construct a rigorously curated bilingual instruction dataset including over 144K Spanish and English samples from 15 datasets covering 7 tasks.
We evaluate our model and existing LLMs using FLARE-ES, the first comprehensive bilingual evaluation benchmark with 21 datasets covering 9 tasks.
arXiv Detail & Related papers (2024-02-12T04:50:31Z) - Is ChatGPT a Financial Expert? Evaluating Language Models on Financial
Natural Language Processing [22.754757518792395]
FinLMEval is a framework for Financial Language Model Evaluation.
This study compares the performance of encoder-only language models and the decoder-only language models.
arXiv Detail & Related papers (2023-10-19T11:43:15Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.