Related papers: Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code

Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code

URL: http://arxiv.org/abs/2402.09299v4
Date: Wed, 30 Oct 2024 19:26:07 GMT
Title: Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code
Authors: Vahid Majdinasab, Amin Nikanjam, Foutse Khomh,
Abstract summary: Code auditing ensures that developed code adheres to standards, regulations, and copyright protection. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
Score: 13.135962181354465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code auditing ensures that the developed code adheres to standards, regulations, and copyright protection by verifying that it does not contain code from protected sources. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. The dataset for training these models is mainly collected from publicly available sources. This raises the issue of intellectual property infringement as developers' codes are already included in the dataset. Therefore, auditing code developed using LLMs is challenging, as it is difficult to reliably assert if an LLM used during development has been trained on specific copyrighted codes, given that we do not have access to the training datasets of these models. Given the non-disclosure of the training datasets, traditional approaches such as code clone detection are insufficient for asserting copyright infringement. To address this challenge, we propose a new approach, TraWiC; a model-agnostic and interpretable method based on membership inference for detecting code inclusion in an LLM's training dataset. We extract syntactic and semantic identifiers unique to each program to train a classifier for detecting code inclusion. In our experiments, we observe that TraWiC is capable of detecting 83.87% of codes that were used to train an LLM. In comparison, the prevalent clone detection tool NiCad is only capable of detecting 47.64%. In addition to its remarkable performance, TraWiC has low resource overhead in contrast to pair-wise clone detection that is conducted during the auditing process of tools like CodeWhisperer reference tracker, across thousands of code snippets.

Related papers

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models [41.227208230253744]
Internal Probing of LLMs for Code generation without external corpus, even unlabeled code snippets.<n>IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation.<n>We validate the proposed approach across multiple code benchmarks.
arXiv Detail & Related papers (2025-12-19T09:42:04Z)
RepoMark: A Data-Usage Auditing Framework for Code Large Language Models [16.976151053365385]
We propose a novel data marking framework RepoMark to audit the data usage of code LLMs.<n>Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation.<n>RepoMark achieves a detection success rate over 90% on small code repositories under a strict FDR guarantee of 5%.
arXiv Detail & Related papers (2025-08-29T09:01:34Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Investigating Training Data Detection in AI Coders [24.912965851207105]
Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering.<n>These models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data.<n>To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task.
arXiv Detail & Related papers (2025-07-23T10:34:22Z)
Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning [8.571111167616165]
Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity.<n>We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection.<n>We propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet.
arXiv Detail & Related papers (2025-06-06T13:23:37Z)
Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
zsLLMCode: An Effective Approach for Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
This paper proposes a novel zero-shot approach, zsLLMCode, to generate code embeddings by using large language models (LLMs) and sentence embedding models. The results have demonstrated the effectiveness and superiority of our method over state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2024-09-23T01:03:15Z)
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation. Recent studies highlight that many LLM-generated code contains serious security vulnerabilities. We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z)
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models. We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks. We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models [7.6875396255520405]
This paper launches the first study of detecting unauthorized code use in CPLMs. We design a framework Buzzer for different settings of Code Membership Inference task.
arXiv Detail & Related papers (2023-12-12T12:07:54Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.