Related papers: A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

URL: http://arxiv.org/abs/2503.13549v1
Date: Sun, 16 Mar 2025 14:35:36 GMT
Title: A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks
Authors: Ronas Shakya, Farhad Vadiee, Mohammad Khalil,
Abstract summary: This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces.<n>Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks.
Score: 2.66269503676104
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.

Related papers

Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems [0.0]
Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks.<n>This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties.<n>We evaluate each model based on execution time, memory usage, and algorithmic complexity, revealing significant performance differences.
arXiv Detail & Related papers (2025-08-05T21:50:52Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Affordable AI Assistants with Knowledge Graph of Thoughts [15.045446816762675]
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. We propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs) KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o.
arXiv Detail & Related papers (2025-04-03T15:11:55Z)
ChatGPT vs. DeepSeek: A Comparative Study on AI-Based Code Generation [0.0]
This research compares ChatGPT and DeepSeek for Python code generation using online judge coding challenges. It evaluates correctness (online judge verdicts, up to three attempts), code quality (Pylint/Flake8), and efficiency (execution time/memory usage) DeepSeek demonstrated higher correctness, particularly on algorithmic tasks, often achieving 'Accepted' on the first attempt.
arXiv Detail & Related papers (2025-01-30T16:14:48Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
Guiding Through Complexity: What Makes Good Supervision for Hard Math Reasoning Tasks? [74.88417042125985]
We investigate various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity.<n>We find that even when the outcome error rate for hard task supervision is high, training on such data can outperform perfectly correct supervision of easier subtasks.<n>Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements.
arXiv Detail & Related papers (2024-10-27T17:55:27Z)
Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants [0.0]
Large language models (LLMs) have become essential for tasks like code generation, bug fixing, and optimization. This paper presents a comparative study of ChatGPT, Codeium, and GitHub Copilot, evaluating their performance on LeetCode problems.
arXiv Detail & Related papers (2024-09-30T03:53:40Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Rocks Coding, Not Development--A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks [9.455579863269714]
We examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers.
arXiv Detail & Related papers (2024-02-08T13:07:31Z)
Evaluating GPT's Programming Capability through CodeWars' Katas [0.5512295869673147]
This paper presents a novel evaluation of the programming proficiency of Generative Pretrained Transformer (GPT) models. The experiments reveal a distinct boundary at the 3kyu level, beyond which these GPT models struggle to provide solutions. The research emphasizes the need for validation and creative thinking capabilities in AI models to better emulate human problem-solving techniques.
arXiv Detail & Related papers (2023-05-31T10:36:16Z)
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [108.54545521369688]
We focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH) We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.
arXiv Detail & Related papers (2022-10-17T17:08:26Z)
Competition-Level Code Generation with AlphaCode [74.87216298566942]
We introduce AlphaCode, a system for code generation that can create novel solutions to problems that require deeper reasoning. In simulated evaluations on recent programming competitions on the Codeforces platform, AlphaCode achieved on average a ranking of top 54.3%.
arXiv Detail & Related papers (2022-02-08T23:16:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.