Discriminating Human-authored from ChatGPT-Generated Code Via
Discernable Feature Analysis
- URL: http://arxiv.org/abs/2306.14397v2
- Date: Tue, 4 Jul 2023 09:23:08 GMT
- Title: Discriminating Human-authored from ChatGPT-Generated Code Via
Discernable Feature Analysis
- Authors: Li Ke, Hong Sheng, Fu Cai, Zhang Yunhe and Liu Ming
- Abstract summary: This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans.
We devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets.
To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code.
- Score: 2.9398911304923447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ubiquitous adoption of Large Language Generation Models (LLMs) in
programming has underscored the importance of differentiating between
human-written code and code generated by intelligent models. This paper
specifically aims to distinguish code generated by ChatGPT from that authored
by humans. Our investigation reveals disparities in programming style,
technical level, and readability between these two sources. Consequently, we
develop a discriminative feature set for differentiation and evaluate its
efficacy through ablation experiments. Additionally, we devise a dataset
cleansing technique, which employs temporal and spatial segmentation, to
mitigate the dearth of datasets and to secure high-caliber, uncontaminated
datasets. To further enrich data resources, we employ "code transformation,"
"feature transformation," and "feature customization" techniques, generating an
extensive dataset comprising 10,000 lines of ChatGPT-generated code. The
salient contributions of our research include: proposing a discriminative
feature set yielding high accuracy in differentiating ChatGPT-generated code
from human-authored code in binary classification tasks; devising methods for
generating extensive ChatGPT-generated codes; and introducing a dataset
cleansing strategy that extracts immaculate, high-grade code datasets from
open-source repositories, thus achieving exceptional accuracy in code
authorship attribution tasks.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Distinguishing LLM-generated from Human-written Code by Contrastive Learning [5.553326595990857]
Large language models (LLMs) have attracted significant attention due to their demonstrated ability to generate high-quality content for various tasks.
There are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering.
This paper proposes a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder.
arXiv Detail & Related papers (2024-11-07T13:39:14Z) - Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search [47.54163552754051]
Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries.
Recently, large language models (LLMs) have made remarkable progress in both natural and programming language understanding and generation.
We propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model.
arXiv Detail & Related papers (2024-08-10T12:51:21Z) - Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code.
We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z) - Assessing the Promise and Pitfalls of ChatGPT for Automated Code
Generation [2.0400340435492272]
This paper presents a comprehensive evaluation of the code generation capabilities of ChatGPT, a prominent large language model.
A dataset of 131 code-generation prompts across 5 categories was curated to enable robust analysis.
Code solutions were generated by both ChatGPT and humans for all prompts, resulting in 262 code samples.
arXiv Detail & Related papers (2023-11-05T12:56:40Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.