Related papers: Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis

Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis

URL: http://arxiv.org/abs/2306.14397v2
Date: Tue, 4 Jul 2023 09:23:08 GMT
Title: Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis
Authors: Li Ke, Hong Sheng, Fu Cai, Zhang Yunhe and Liu Ming
Abstract summary: This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. We devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code.
Score: 2.9398911304923447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code. The salient contributions of our research include: proposing a discriminative feature set yielding high accuracy in differentiating ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating extensive ChatGPT-generated codes; and introducing a dataset cleansing strategy that extracts immaculate, high-grade code datasets from open-source repositories, thus achieving exceptional accuracy in code authorship attribution tasks.

Related papers

Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains.<n>We highlight key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement.<n>We discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z)
Distinguishing LLM-generated from Human-written Code by Contrastive Learning [5.553326595990857]
Large language models (LLMs) have attracted significant attention due to their demonstrated ability to generate high-quality content for various tasks. There are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering. This paper proposes a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder.
arXiv Detail & Related papers (2024-11-07T13:39:14Z)
Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency. We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z)
You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search [47.54163552754051]
Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. Recently, large language models (LLMs) have made remarkable progress in both natural and programming language understanding and generation. We propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model.
arXiv Detail & Related papers (2024-08-10T12:51:21Z)
Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code. We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z)
Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation [2.0400340435492272]
This paper presents a comprehensive evaluation of the code generation capabilities of ChatGPT, a prominent large language model. A dataset of 131 code-generation prompts across 5 categories was curated to enable robust analysis. Code solutions were generated by both ChatGPT and humans for all prompts, resulting in 262 code samples.
arXiv Detail & Related papers (2023-11-05T12:56:40Z)
Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful" We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z)
CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code. We conduct a human study to identify the criteria for high-quality explanatory docstring for code. We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks. We first represent both natural language query texts and programming language code snippets with the unified graph-structured data. In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.