Related papers: Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

URL: http://arxiv.org/abs/2503.11655v2
Date: Mon, 30 Jun 2025 12:58:45 GMT
Title: Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning
Authors: Donghao Huang, Zhaoxia Wang,
Abstract summary: DeepSeek-R1 is an open-source reasoning model against OpenAI's GPT-4o and GPT-4o-mini.<n>Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots.
Score: 1.1318175666743655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1--an open-source reasoning model--against OpenAI's GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39\% F1 score on 5-class sentiment and 99.31\% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.

Related papers

Skywork Open Reasoner 1 Technical Report [51.403686909760914]
We present Skywork-OR1, an effective and scalable reinforcement learning (RL) implementation for long Chain-of-Thought (CoT) models.<n>Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains.<n>Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks.
arXiv Detail & Related papers (2025-05-28T12:56:04Z)
Scaling Reasoning can Improve Factuality in Large Language Models [7.184302333801519]
We thoroughly examine large language model (LLM) reasoning within complex open-domain question-answering (QA) scenarios.<n>To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces.<n>Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts.
arXiv Detail & Related papers (2025-05-16T11:39:33Z)
Reinforcement Learning for Reasoning in Large Language Models with One Training Example [133.018487956408]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs)<n>We identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated.
arXiv Detail & Related papers (2025-04-29T09:24:30Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs [0.0]
We present SmolTulu, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model.<n>Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios.
arXiv Detail & Related papers (2024-12-11T12:41:36Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
PatchAD: A Lightweight Patch-based MLP-Mixer for Time Series Anomaly Detection [11.236001767352676]
Time series anomaly detection is a pivotal task in data analysis, yet it poses the challenge of discerning normal and abnormal patterns in label-deficient scenarios.<n>We present PatchAD, our novel, highly efficient multiscale patch-based-Mixer architecture that utilizes contrastive learning for representation extraction and anomaly detection.
arXiv Detail & Related papers (2024-01-18T08:26:33Z)
Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore [71.09522172098733]
We utilize graph representation in FSAD and provide a novel visual invariant feature (VIIF) as anomaly measurement feature. VIIF can robustly improve the anomaly discriminating ability and can further reduce the size of redundant features stored in M. Besides, we provide a novel model GraphCore via VIIFs that can fast implement unsupervised FSAD training and can improve the performance of anomaly detection.
arXiv Detail & Related papers (2023-01-28T03:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.