Related papers: Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts

Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts

URL: http://arxiv.org/abs/2510.23464v1
Date: Mon, 27 Oct 2025 16:03:20 GMT
Title: Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts
Authors: Nikesh Gyawali, Doina Caragea, Alex Vasenkov, Cornelia Caragea,
Abstract summary: We introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales.<n>The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance using the advanced ChatGPT-o3-pro model.<n>Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies.
Score: 45.13099538394587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.

Related papers

Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection [64.75447949495307]
Large language models (LLMs) have been widely applied across various domains of finance.<n> behavioral biases can lead to instability and uncertainty in decision-making.<n>mfmdscen is a benchmark for evaluating behavioral biases in mfmd across diverse economic scenarios.
arXiv Detail & Related papers (2026-01-08T22:00:32Z)
Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection [17.04809129025246]
FinFRE-RAG is a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language.<n>LLMs can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts.
arXiv Detail & Related papers (2025-12-15T07:09:11Z)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z)
M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset [18.752133381125564]
$texttM$3$FinMeeting$ is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding.<n>First, it supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts.<n>Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS)<n>Third, it includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding.
arXiv Detail & Related papers (2025-06-03T06:41:09Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
Large Language Model Adaptation for Financial Sentiment Analysis [2.0499240875882]
Generalist language models tend to fall short in tasks specifically tailored for finance. Two foundation models with less than 1.5B parameters have been adapted using a wide range of strategies. We show that small LLMs have comparable performance to larger scale models, while being more efficient in terms of parameters and data.
arXiv Detail & Related papers (2024-01-26T11:04:01Z)
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models [11.154814189699735]
Large Language Models (LLMs) pre-trained on extensive corpora have demonstrated superior performance across various NLP tasks. We introduce a retrieval-augmented LLMs framework for financial sentiment analysis. Our approach achieves 15% to 48% performance gain in accuracy and F1 score.
arXiv Detail & Related papers (2023-10-06T05:40:23Z)
Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models [18.212210748797332]
We introduce a simple yet effective instruction tuning approach to address these issues. In the experiment, our approach outperforms state-of-the-art supervised sentiment analysis models.
arXiv Detail & Related papers (2023-06-22T03:56:38Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.