A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
- URL: http://arxiv.org/abs/2509.14922v1
- Date: Thu, 18 Sep 2025 12:59:07 GMT
- Title: A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
- Authors: Kian Tohidi, Kia Dashtipour, Simone Rebora, Sevda Pourfaramarz,
- Abstract summary: This study presents a comparative evaluation of four large language models (LLMs) for sentiment analysis and emotion detection in Persian social media texts.<n>The results show that all models reach an acceptable level of performance, and a statistical comparison of the best three models indicates no significant differences among them.<n>The findings indicate that the emotion detection task is more challenging for all models compared to the sentiment analysis task, and the misclassification patterns can represent some challenges in Persian language texts.
- Score: 2.820011731460364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research addresses these gaps through rigorous experimental design using balanced Persian datasets containing 900 texts for sentiment analysis (positive, negative, neutral) and 1,800 texts for emotion detection (anger, fear, happiness, hate, sadness, surprise). The main focus was to allow for a direct and fair comparison among different models, by using consistent prompts, uniform processing parameters, and by analyzing the performance metrics such as precision, recall, F1-scores, along with misclassification patterns. The results show that all models reach an acceptable level of performance, and a statistical comparison of the best three models indicates no significant differences among them. However, GPT-4o demonstrated a marginally higher raw accuracy value for both tasks, while Gemini 2.0 Flash proved to be the most cost-efficient. The findings indicate that the emotion detection task is more challenging for all models compared to the sentiment analysis task, and the misclassification patterns can represent some challenges in Persian language texts. These findings establish performance benchmarks for Persian NLP applications and offer practical guidance for model selection based on accuracy, efficiency, and cost considerations, while revealing cultural and linguistic challenges that require consideration in multilingual AI system deployment.
Related papers
- LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models [49.92148175114169]
We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions.<n>Models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states.<n>Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely.
arXiv Detail & Related papers (2025-10-15T14:51:36Z) - Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning [0.0]
This paper presents a benchmark of several open-source Large Language Models (LLMs) for Persian Natural Language Processing tasks.<n>We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering.<n>The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms.
arXiv Detail & Related papers (2025-10-05T10:10:04Z) - An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models [2.1945750784330067]
This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source)<n>We assessed models on seven diverse datasets using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality.
arXiv Detail & Related papers (2025-04-06T16:24:22Z) - Fine-Tuning Language Models for Ethical Ambiguity: A Comparative Study of Alignment with Human Responses [1.566834021297545]
Language models often misinterpret human intentions due to their handling of ambiguity.
We show that human and LLM judgments are poorly aligned in morally ambiguous contexts.
Our fine-tuning approach, which improves the model's understanding of text distributions in a text-to-text format, effectively enhances performance and alignment.
arXiv Detail & Related papers (2024-10-10T11:24:04Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection.
We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z) - SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL)
SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.