Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis
- URL: http://arxiv.org/abs/2506.04574v1
- Date: Thu, 05 Jun 2025 02:47:23 GMT
- Title: Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis
- Authors: Dimitris Vamvourellis, Dhagash Mehta,
- Abstract summary: We evaluate how various large language models (LLMs) align with human-labeled sentiment in a financial context.<n>Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task.<n>Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting.
- Score: 1.3812010983144802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.
Related papers
- Your AI, Not Your View: The Bias of LLMs in Investment Analysis [55.328782443604986]
Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data.<n>This paper offers the first quantitative analysis of confirmation bias in LLM-based investment analysis.<n>We observe a consistent preference for large-cap stocks and contrarian strategies across most models.
arXiv Detail & Related papers (2025-07-28T16:09:38Z) - Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks.<n>This study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies.
arXiv Detail & Related papers (2025-07-24T20:10:27Z) - Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios [0.0]
ConDiFi is a benchmark that jointly evaluates divergent and convergent thinking in LLMs for financial tasks.<n> GPT-4o underperforms on Novelty and Actionability, while models like DeepSeek-R1 and Cohere Command R+ rank among the top for generating actionable, insights suitable for investment decisions.
arXiv Detail & Related papers (2025-07-24T12:47:29Z) - FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning [18.68776736676411]
FinEval-KR is a novel evaluation framework for quantifying large language models' knowledge and reasoning abilities.<n>Inspired by cognitive science, we propose a cognitive score to analyze capabilities in reasoning tasks across different cognitive levels.<n>Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy.
arXiv Detail & Related papers (2025-06-18T06:21:50Z) - Adaptive Thinking via Mode Policy Optimization for Social Language Agents [75.3092060637826]
We propose a framework to improve the adaptive thinking ability of language agents in dynamic social interactions.<n>Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing.
arXiv Detail & Related papers (2025-05-04T15:39:58Z) - Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach [6.911426601915051]
This paper introduces a new benchmark designed to evaluate how well AI models - especially large language and multimodal models - reason in finance-specific contexts.<n>We propose an error-aware learning framework that leverages historical model mistakes and feedback to guide inference, without requiring fine-tuning.<n>The results highlight persistent challenges in visual understanding and mathematical logic, while also demonstrating the promise of self-reflective reasoning in financial AI systems.
arXiv Detail & Related papers (2025-04-22T07:25:03Z) - Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models [51.85792055455284]
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks.<n>System 1 reasoning is computationally efficient but leads to suboptimal performance.<n>System 2 reasoning often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors.
arXiv Detail & Related papers (2025-03-31T17:58:07Z) - Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance [32.516564836540745]
Large language models (LLMs) have shown strong general reasoning capabilities, but their effectiveness in financial reasoning remains underexplored.<n>We evaluate 24 state-of-the-art general and reasoning-focused LLMs across four complex financial reasoning tasks.<n>We propose two domain-adapted models, Fino1-8B and FinoB, trained with chain-of-thought (CoT) fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-02-12T05:13:04Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications [2.2661367844871854]
Large Language Models (LLMs) can be used in this context, but they are not finance-specific and tend to require significant computational resources.
We introduce a novel approach based on the Llama 2 7B foundational model, in order to benefit from its generative nature and comprehensive language manipulation.
This is achieved by fine-tuning the Llama2 7B model on a small portion of supervised financial sentiment analysis data.
arXiv Detail & Related papers (2024-03-18T22:11:00Z) - Are LLMs Rational Investors? A Study on Detecting and Reducing the Financial Bias in LLMs [44.53203911878139]
Large Language Models (LLMs) are increasingly adopted in financial analysis for interpreting complex market data and trends.
Financial Bias Indicators (FBI) is a framework with components like Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote.
We evaluate 23 leading LLMs and propose a de-biasing method based on financial causal knowledge.
arXiv Detail & Related papers (2024-02-20T04:26:08Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language Models [19.466985579720507]
Large language models (LLMs) have shown impressive emergent abilities in a wide range of tasks, but the associated expensive API cost greatly limits the real application.
We propose "Synergy of Thoughts"(SoT) to unleash the synergistic potential of hybrid LLMs with different scales for efficient reasoning.
We show that SoT substantially reduces the API cost by 38.3%-75.1%, and simultaneously achieves state-of-the-art reasoning accuracy and solution diversity.
arXiv Detail & Related papers (2024-02-04T16:45:01Z) - From Heuristic to Analytic: Cognitively Motivated Strategies for
Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions.
Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z) - Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models [51.3422222472898]
We document the capability of large language models (LLMs) like ChatGPT to predict stock price movements using news headlines.
We develop a theoretical model incorporating information capacity constraints, underreaction, limits-to-arbitrage, and LLMs.
arXiv Detail & Related papers (2023-04-15T19:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.