Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
- URL: http://arxiv.org/abs/2602.24060v1
- Date: Fri, 27 Feb 2026 14:49:05 GMT
- Title: Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
- Authors: Donghao Huang, Zhaoxia Wang,
- Abstract summary: Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks.<n>We test this claim through a comprehensive evaluation of 504 configurations across seven model families.<n>Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions.
- Score: 2.1036545320600095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including adaptive, conditional, and reinforcement learning-based reasoning architectures--on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence--binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.
Related papers
- Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models [0.0]
We present a controlled study of multi-hop contextual reasoning in large language models.<n>We show that multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail.
arXiv Detail & Related papers (2026-01-06T20:18:55Z) - Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z) - Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning.<n>We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations.<n>We develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems.
arXiv Detail & Related papers (2025-11-20T18:59:00Z) - An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR [0.0]
GPT-4.1-Mini consistently achieved the highest overall accuracy across all architectures.<n>Each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model-specific.
arXiv Detail & Related papers (2025-11-14T22:50:22Z) - Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z) - Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models [6.312798900093575]
Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning.<n>This paper focuses on the fundamental tradeoff between accuracy and overthinking.<n>We introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation.
arXiv Detail & Related papers (2025-07-05T12:31:17Z) - ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs [54.154593699263074]
ProtoReasoning is a framework that enhances the reasoning ability of Large Reasoning Models.<n>ProtoReasoning transforms problems into corresponding prototype representations.<n>ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning.
arXiv Detail & Related papers (2025-06-18T07:44:09Z) - Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning [22.32435186013626]
We propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors.<n>Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks, we identify a concise three-node linear causal structure that reliably explains the observed performance variations.
arXiv Detail & Related papers (2025-06-12T06:07:42Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Follow the Path: Reasoning over Knowledge Graph Paths to Improve LLM Factuality [13.453581744429352]
We introduce fs1, a method that improves the factuality of reasoning traces by sourcing them from large reasoning models.<n>We fine-tune eight instruction-tuned Large Language Models (LLMs) on 3.9K factually grounded reasoning traces.<n>Our results demonstrate that our fs1-tuned model consistently outperforms instruction-tuned counterparts with parallel sampling by 6-14 absolute points.
arXiv Detail & Related papers (2025-05-16T11:39:33Z) - Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models [1.4956870931936515]
We introduce RDoLT, a novel framework that significantly boosts Large Language Models reasoning performance.<n>RDoLT is built on three key innovations: (1) breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning.<n>Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArithm, LastLetterConcatenation, and Gaokao2023 Math.
arXiv Detail & Related papers (2025-01-03T02:55:44Z) - Self-Discover: Large Language Models Self-Compose Reasoning Structures [136.48389510481758]
We introduce SELF-DISCOVER, a framework for self-discovering task-intrinsic reasoning structures.
SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks.
We show that the self-discovered reasoning structures are universally applicable across model families.
arXiv Detail & Related papers (2024-02-06T01:13:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.