Trained Without My Consent: Detecting Code Inclusion In Language Models
Trained on Code
- URL: http://arxiv.org/abs/2402.09299v1
- Date: Wed, 14 Feb 2024 16:41:35 GMT
- Title: Trained Without My Consent: Detecting Code Inclusion In Language Models
Trained on Code
- Authors: Vahid Majdinasab, Amin Nikanjam, Foutse Khomh
- Abstract summary: Code auditing ensures that developed code adheres to standards, regulations, and copyright protection.
The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing.
We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
- Score: 14.763505073094779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code auditing ensures that the developed code adheres to standards,
regulations, and copyright protection by verifying that it does not contain
code from protected sources. The recent advent of Large Language Models (LLMs)
as coding assistants in the software development process poses new challenges
for code auditing. The dataset for training these models is mainly collected
from publicly available sources. This raises the issue of intellectual property
infringement as developers' codes are already included in the dataset.
Therefore, auditing code developed using LLMs is challenging, as it is
difficult to reliably assert if an LLM used during development has been trained
on specific copyrighted codes, given that we do not have access to the training
datasets of these models. Given the non-disclosure of the training datasets,
traditional approaches such as code clone detection are insufficient for
asserting copyright infringement. To address this challenge, we propose a new
approach, TraWiC; a model-agnostic and interpretable method based on membership
inference for detecting code inclusion in an LLM's training dataset. We extract
syntactic and semantic identifiers unique to each program to train a classifier
for detecting code inclusion. In our experiments, we observe that TraWiC is
capable of detecting 83.87% of codes that were used to train an LLM. In
comparison, the prevalent clone detection tool NiCad is only capable of
detecting 47.64%. In addition to its remarkable performance, TraWiC has low
resource overhead in contrast to pair-wise clone detection that is conducted
during the auditing process of tools like CodeWhisperer reference tracker,
across thousands of code snippets.
Related papers
- VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
We introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions.
We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE)
Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs.
Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code.
Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z) - Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [69.38352966504401]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - Code Membership Inference for Detecting Unauthorized Data Use in Code
Pre-trained Language Models [7.6875396255520405]
This paper launches the first study of detecting unauthorized code use in CPLMs.
We design a framework Buzzer for different settings of Code Membership Inference task.
arXiv Detail & Related papers (2023-12-12T12:07:54Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.