Related papers: Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

URL: http://arxiv.org/abs/2307.05360v3
Date: Fri, 24 May 2024 20:28:09 GMT
Title: Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures
Authors: Sayed Erfan Arefin, Tasnia Ashrafi Heya, Hasan Al-Qudah, Ynes Ineza, Abdul Serwadda,
Abstract summary: We evaluate ChatGPT's ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code. We look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations.
Score: 0.6990493129893112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformative influence of Large Language Models (LLMs) is profoundly reshaping the Artificial Intelligence (AI) technology domain. Notably, ChatGPT distinguishes itself within these models, demonstrating remarkable performance in multi-turn conversations and exhibiting code proficiency across an array of languages. In this paper, we carry out a comprehensive evaluation of ChatGPT's coding capabilities based on what is to date the largest catalog of coding challenges. Our focus is on the python programming language and problems centered on data structures and algorithms, two topics at the very foundations of Computer Science. We evaluate ChatGPT for its ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code. Where ChatGPT code successfully executes, but fails to solve the problem at hand, we look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations. To infer whether ChatGPT might have directly memorized some of the data that was used to train it, we methodically design an experiment to investigate this phenomena. Making comparisons with human performance whenever feasible, we investigate all the above questions from the context of both its underlying learning models (GPT-3.5 and GPT-4), on a vast array sub-topics within the main topics, and on problems having varying degrees of difficulty.

Related papers

Benchmarking ChatGPT on Algorithmic Reasoning [58.50071292008407]
We evaluate ChatGPT's ability to solve algorithm problems from the CLRS benchmark suite that is designed for GNNs. We find that ChatGPT outperforms specialist GNN models, using Python to successfully solve these problems.
arXiv Detail & Related papers (2024-04-04T13:39:06Z)
Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z)
Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues [17.7880460531813]
We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages. We identify and characterize potential issues with the quality of ChatGPT-generated code. We find that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement.
arXiv Detail & Related papers (2023-07-24T08:14:22Z)
Extending the Frontier of ChatGPT: Code Generation and Debugging [0.0]
ChatGPT, developed by OpenAI, has ushered in a new era by utilizing artificial intelligence (AI) to tackle diverse problem domains. This research paper delves into the efficacy of ChatGPT in solving programming problems, examining both the correctness and the efficiency of its solution in terms of time and memory complexity. The research reveals a commendable overall success rate of 71.875%, denoting the proportion of problems for which ChatGPT was able to provide correct solutions.
arXiv Detail & Related papers (2023-07-17T06:06:58Z)
Automatic Code Summarization via ChatGPT: How Far Are We? [10.692654700225411]
We evaluate ChatGPT on a widely-used Python dataset called CSN-Python. In terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models. Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
arXiv Detail & Related papers (2023-05-22T09:43:40Z)
Is ChatGPT the Ultimate Programming Assistant -- How far is it? [11.943927095071105]
ChatGPT has received great attention: it can be used as a bot for discussing source code. We present an empirical study of ChatGPT's potential as a fully automated programming assistant.
arXiv Detail & Related papers (2023-04-24T09:20:13Z)
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z)
ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking about [15.19126287569545]
This research examines the responses generated by ChatGPT from different Conversational QA corpora. The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels. The study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
arXiv Detail & Related papers (2023-04-06T18:42:47Z)
To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries. However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.