Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis
- URL: http://arxiv.org/abs/2411.07529v1
- Date: Tue, 12 Nov 2024 04:01:09 GMT
- Title: Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis
- Authors: Minda Li, Bhaskar Krishnamachari,
- Abstract summary: We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode.
We show ChatGPT solves fewer problems as difficulty rises.
Second, prompt engineering improves ChatGPT's performance.
Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket.
- Score: 6.123324869194196
- License:
- Abstract: ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.
Related papers
- Benchmarking ChatGPT on Algorithmic Reasoning [58.50071292008407]
We evaluate ChatGPT's ability to solve algorithm problems from the CLRS benchmark suite that is designed for GNNs.
We find that ChatGPT outperforms specialist GNN models, using Python to successfully solve these problems.
arXiv Detail & Related papers (2024-04-04T13:39:06Z) - In-Context Principle Learning from Mistakes [75.66979331850364]
Incontext learning (ICL) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples.
We revisit this paradigm, by learning more from the few given input-output examples.
arXiv Detail & Related papers (2024-02-08T04:42:29Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - ChatGPT-4 with Code Interpreter can be used to solve introductory
college-level vector calculus and electromagnetism problems [0.0]
We evaluated ChatGPT 3.5, 4, and 4 with Code Interpreter on a set of college-level engineering-math and electromagnetism problems.
ChatGPT-4 with Code Interpreter was able to satisfactorily solve most problems we tested most of the time.
arXiv Detail & Related papers (2023-09-16T05:19:39Z) - Refining ChatGPT-Generated Code: Characterizing and Mitigating Code
Quality Issues [17.7880460531813]
We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages.
We identify and characterize potential issues with the quality of ChatGPT-generated code.
We find that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement.
arXiv Detail & Related papers (2023-07-24T08:14:22Z) - Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures [0.6990493129893112]
We evaluate ChatGPT's ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code.
We look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations.
arXiv Detail & Related papers (2023-07-10T08:20:34Z) - Automatic Code Summarization via ChatGPT: How Far Are We? [10.692654700225411]
We evaluate ChatGPT on a widely-used Python dataset called CSN-Python.
In terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models.
Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
arXiv Detail & Related papers (2023-05-22T09:43:40Z) - Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests
with Differential Prompting [20.914970341922707]
ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs.
A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version.
We propose a novel approach that combines ChatGPT and differential testing to find failure-inducing test cases.
arXiv Detail & Related papers (2023-04-23T15:35:39Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.