Related papers: AI-assisted coding: Experiments with GPT-4

AI-assisted coding: Experiments with GPT-4

URL: http://arxiv.org/abs/2304.13187v1
Date: Tue, 25 Apr 2023 22:59:01 GMT
Title: AI-assisted coding: Experiments with GPT-4
Authors: Russell A Poldrack, Thomas Lu, and Ga\v{s}per Begu\v{s}
Abstract summary: GPT-4 can generate tests with substantial coverage, but that many of the tests fail applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
Score: 0.22366638308792727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Comparing Human and LLM Generated Code: The Jury is Still Out! [8.456554883523472]
We compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code. We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases. We observe security flaws in code generated by both humans and GPT-4, but GPT-4 code included more severe outliers.
arXiv Detail & Related papers (2025-01-28T11:11:36Z)
A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT [0.0]
This study employs a methodological approach, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale.
arXiv Detail & Related papers (2025-01-07T09:15:25Z)
An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? [8.0988059417354]
We propose a range of approaches to improve the performance of AI-generated code detection. Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55.
arXiv Detail & Related papers (2024-11-06T22:48:18Z)
Disrupting Test Development with AI Assistants [1.024113475677323]
Generative AI-assisted coding tools like GitHub Copilot, ChatGPT, and Tabnine have significantly transformed software development. This paper analyzes how these innovations impact productivity and software test development metrics.
arXiv Detail & Related papers (2024-11-04T17:52:40Z)
Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data [49.1574468325115]
ChatGPT is an AI tool that enhances software production efficiency. We estimate ChatGPT's effects on the number of git pushes, repositories, and unique developers per 100,000 people. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
arXiv Detail & Related papers (2024-06-16T19:11:15Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Whodunit: Classifying Code as Human Authored or GPT-4 Generated -- A case study on CodeChef problems [0.13124513975412253]
We use code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.
arXiv Detail & Related papers (2024-03-06T19:51:26Z)
OpenAi's GPT4 as coding assistant [0.0]
GPT4 is considered the most potent Large Language Model from Openai. In this paper, we examine GPT3.5 and GPT4 as coding assistants.
arXiv Detail & Related papers (2023-09-22T09:31:39Z)
Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z)
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models [25.726216146776054]
We show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task. We propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value.
arXiv Detail & Related papers (2022-10-29T05:03:28Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.