Related papers: Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems

Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems

URL: http://arxiv.org/abs/2508.03931v1
Date: Tue, 05 Aug 2025 21:50:52 GMT
Title: Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems
Authors: Everton Guimaraes, Nathalia Nascimento, Chandan Shivalingaiah, Asish Nelapati,
Abstract summary: Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks.<n>This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties.<n>We evaluate each model based on execution time, memory usage, and algorithmic complexity, revealing significant performance differences.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks, including code generation, testing, and debugging. As these models become integral to development workflows, a systematic comparison of their performance is essential for optimizing their use in real world applications. This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties, generating solutions in Java and Python. We evaluate each model based on execution time, memory usage, and algorithmic complexity, revealing significant performance differences. ChatGPT demonstrates consistent efficiency in execution time and memory usage, while Copilot and DeepSeek show variability as task complexity increases. Gemini, although effective on simpler tasks, requires more attempts as problem difficulty rises. Our findings provide actionable insights into each model's strengths and limitations, offering guidance for developers selecting LLMs for specific coding tasks and providing insights on the performance and complexity of GPT-like generated solutions.

Related papers

EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents [40.37993572657772]
We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions.<n>We demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.
arXiv Detail & Related papers (2025-05-30T19:23:51Z)
A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks [2.66269503676104]
This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces.<n>Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks.
arXiv Detail & Related papers (2025-03-16T14:35:36Z)
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation [20.75363011870647]
DynaCode is a dynamic, complexity-aware benchmark for large language models (LLMs)<n>It evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures.<n>Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark.
arXiv Detail & Related papers (2025-03-13T15:18:56Z)
Performance Review on LLM for solving leetcode problems [7.377558533352298]
This paper presents a comprehensive performance evaluation of Large Language Models (LLMs) in solving programming challenges from Leetcode.<n>We generated solutions with multiple LLMs, including GPT-4 and GPT-3.5-turbo.<n>Our results highlight the strengths and limitations of current LLMs in code generation and problem-solving tasks.
arXiv Detail & Related papers (2025-02-16T08:52:45Z)
Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types. We show that even the most advanced LLMs fail to solve these problems end-to-end in text. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z)
PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.