Evaluation of the Code Generation Capabilities of ChatGPT 4: A Comparative Analysis in 19 Programming Languages
- URL: http://arxiv.org/abs/2501.02338v1
- Date: Sat, 04 Jan 2025 17:17:01 GMT
- Title: Evaluation of the Code Generation Capabilities of ChatGPT 4: A Comparative Analysis in 19 Programming Languages
- Authors: L. C. Gilbert,
- Abstract summary: This thesis examines the capabilities of ChatGPT 4 in code generation across 19 programming languages.
ChatGPT 4 successfully solved 39.67% of all tasks, with success rates decreasing significantly as problem complexity increased.
The model exhibited above-average runtime efficiency in all programming languages.
- Score: 0.0
- License:
- Abstract: This bachelor's thesis examines the capabilities of ChatGPT 4 in code generation across 19 programming languages. The study analyzed solution rates across three difficulty levels, types of errors encountered, and code quality in terms of runtime and memory efficiency through a quantitative experiment. A total of 188 programming problems were selected from the LeetCode platform, and ChatGPT 4 was given three attempts to produce a correct solution with feedback. ChatGPT 4 successfully solved 39.67% of all tasks, with success rates decreasing significantly as problem complexity increased. Notably, the model faced considerable challenges with hard problems across all languages. ChatGPT 4 demonstrated higher competence in widely used languages, likely due to a larger volume and higher quality of training data. The solution rates also revealed a preference for languages with low abstraction levels and static typing. For popular languages, the most frequent error was "Wrong Answer," whereas for less popular languages, compiler and runtime errors prevailed, suggesting frequent misunderstandings and confusion regarding the structural characteristics of these languages. The model exhibited above-average runtime efficiency in all programming languages, showing a tendency toward statically typed and low-abstraction languages. Memory efficiency results varied significantly, with above-average performance in 14 languages and below-average performance in five languages. A slight preference for low-abstraction languages and a leaning toward dynamically typed languages in terms of memory efficiency were observed. Future research should include a larger number of tasks, iterations, and less popular languages. Additionally, ChatGPT 4's abilities in code interpretation and summarization, debugging, and the development of complex, practical code could be analyzed further.
Related papers
- Vulnerability Detection in Popular Programming Languages with Language Models [2.048226951354646]
This paper investigates the effectiveness of Language Models (LMs) for vulnerability detection in JavaScript, Java, Python, PHP, and Go.
We find that JavaScript exhibits the best performance, with considerably better and more practical detection capabilities compared to C/C++.
arXiv Detail & Related papers (2024-12-20T13:57:27Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models.
We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages.
We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z) - The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - Learning Transfers over Several Programming Languages [5.350495525141013]
Cross-lingual transfer uses data from a source language to improve model performance on a target language.
This paper reports extensive experiments on four tasks using a transformer-based large language model and 11 to 41 programming languages.
We find that learning transfers well across several programming languages.
arXiv Detail & Related papers (2023-10-25T19:04:33Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - Measuring The Impact Of Programming Language Distribution [28.96076723773365]
We present the BabelCode framework for execution-based evaluation of any benchmark in any language.
We present a new code translation dataset called Translating Python Programming Puzzles (TP3)
We investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages.
arXiv Detail & Related papers (2023-02-03T19:47:22Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.