Related papers: Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance

Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance

URL: http://arxiv.org/abs/2506.18501v3
Date: Sat, 05 Jul 2025 12:37:20 GMT
Title: Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
Authors: Wael Etaiwi, Bushra Alhijawi,
Abstract summary: This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks.<n>These tasks include sentiment analysis, topic classification, text summarization, machine translation, and textual entailment.<n>The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.

Related papers

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks [23.34710429552906]
We evaluate the capabilities and limitations of five prevalent Large Language Models: Llama, OPT, Falcon, Alpaca, and MPT.<n>The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation.
arXiv Detail & Related papers (2024-11-26T08:21:24Z)
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z)
SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL) SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors. We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z)
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z)
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference. The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level. We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt) This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.