Whodunit: Classifying Code as Human Authored or GPT-4 Generated -- A
case study on CodeChef problems
- URL: http://arxiv.org/abs/2403.04013v1
- Date: Wed, 6 Mar 2024 19:51:26 GMT
- Title: Whodunit: Classifying Code as Human Authored or GPT-4 Generated -- A
case study on CodeChef problems
- Authors: Oseremen Joy Idialu, Noble Saji Mathews, Rungroj Maipradit, Joanne M.
Atlee, Mei Nagappan
- Abstract summary: We use code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code.
Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4.
Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.
- Score: 0.13124513975412253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT,
built on large language models like GPT-4, are revolutionizing how programming
tasks are performed, raising questions about whether code is authored by
generative AI models. Such questions are of particular interest to educators,
who worry that these tools enable a new form of academic dishonesty, in which
students submit AI generated code as their own work. Our research explores the
viability of using code stylometry and machine learning to distinguish between
GPT-4 generated and human-authored code. Our dataset comprises human-authored
solutions from CodeChef and AI-authored solutions generated by GPT-4. Our
classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A
variant of our classifier that excludes gameable features (e.g., empty lines,
whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We
also evaluated our classifier with respect to the difficulty of the programming
problem and found that there was almost no difference between easier and
intermediate problems, and the classifier performed only slightly worse on
harder problems. Our study shows that code stylometry is a promising approach
for distinguishing between GPT-4 generated code and human-authored code.
Related papers
- Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? [8.0988059417354]
We propose a range of approaches to improve the performance of AI-generated code detection.
Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55.
arXiv Detail & Related papers (2024-11-06T22:48:18Z) - Genetic Auto-prompt Learning for Pre-trained Code Intelligence Language Models [54.58108387797138]
We investigate the effectiveness of prompt learning in code intelligence tasks.
Existing automatic prompt design methods are very limited to code intelligence tasks.
We propose Genetic Auto Prompt (GenAP) which utilizes an elaborate genetic algorithm to automatically design prompts.
arXiv Detail & Related papers (2024-03-20T13:37:00Z) - Enhancing Programming Error Messages in Real Time with Generative AI [0.0]
We implement feedback from ChatGPT for all programs submitted to our automated assessment tool, Athene.
Our results indicate that adding generative AI to an automated assessment tool does not necessarily make it better.
arXiv Detail & Related papers (2024-02-12T21:32:05Z) - Assessing AI Detectors in Identifying AI-Generated Code: Implications
for Education [8.592066814291819]
We present an empirical study where the LLM is examined for its attempts to bypass detection by AIGC Detectors.
This is achieved by generating code in response to a given question using different variants.
Our results demonstrate that existing AIGC Detectors perform poorly in distinguishing between human-written code and AI-generated code.
arXiv Detail & Related papers (2024-01-08T05:53:52Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance.
This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z) - Is this Snippet Written by ChatGPT? An Empirical Study with a
CodeBERT-Based Classifier [13.613735709997911]
This paper presents an empirical study to investigate the feasibility of automated identification of AI-generated code snippets.
We propose a novel approach called GPTSniffer, which builds on top of CodeBERT to detect source code written by AI.
The results show that GPTSniffer can accurately classify whether code is human-written or AI-generated, and outperforms two baselines.
arXiv Detail & Related papers (2023-07-18T16:01:15Z) - AI-assisted coding: Experiments with GPT-4 [0.22366638308792727]
GPT-4 can generate tests with substantial coverage, but that many of the tests fail applied to the associated code.
These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
arXiv Detail & Related papers (2023-04-25T22:59:01Z) - Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code.
We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.