Code Membership Inference for Detecting Unauthorized Data Use in Code
Pre-trained Language Models
- URL: http://arxiv.org/abs/2312.07200v1
- Date: Tue, 12 Dec 2023 12:07:54 GMT
- Title: Code Membership Inference for Detecting Unauthorized Data Use in Code
Pre-trained Language Models
- Authors: Sheng Zhang, Hui Li
- Abstract summary: This paper launches the first study of detecting unauthorized code use in CPLMs.
We design a framework Buzzer for different settings of Code Membership Inference task.
- Score: 7.6875396255520405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code pre-trained language models (CPLMs) have received great attention since
they can benefit various tasks that facilitate software development and
maintenance. However, CPLMs are trained on massive open-source code, raising
concerns about potential data infringement. This paper launches the first study
of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference
(CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer
deploys several inference techniques, including distilling the target CPLM,
ensemble inference, and unimodal and bimodal calibration. Extensive experiments
show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer
can serve as a CMI tool and help protect intellectual property rights.
Related papers
- M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs.
Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code.
Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z) - Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [69.38352966504401]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - Zero-Shot Code Representation Learning via Prompt Tuning [6.40875582886359]
We propose Zecoler, a zero-shot approach for learning code representations.
Zecoler is built upon a pre-trained programming language model.
We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation.
arXiv Detail & Related papers (2024-04-13T09:47:07Z) - A Survey on Knowledge Distillation of Large Language Models [102.84645991075283]
Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities to open-source models.
This paper presents a comprehensive survey of KD's role within the realm of Large Language Models (LLMs)
arXiv Detail & Related papers (2024-02-20T16:17:37Z) - Trained Without My Consent: Detecting Code Inclusion In Language Models
Trained on Code [14.763505073094779]
Code auditing ensures that developed code adheres to standards, regulations, and copyright protection.
The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing.
We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
arXiv Detail & Related papers (2024-02-14T16:41:35Z) - How to get better embeddings with code pre-trained models? An empirical
study [6.220333404184779]
We study five different code pre-trained models (PTMs) to generate embeddings for downstream classification tasks.
We find that embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet.
The quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information.
arXiv Detail & Related papers (2023-11-14T10:44:21Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Differentially Private Decoding in Large Language Models [14.221692239892207]
We propose a simple, easy to interpret, and computationally lightweight perturbation mechanism to be applied to an already trained model at the decoding stage.
Our perturbation mechanism is model-agnostic and can be used in conjunction with any Large Language Model.
arXiv Detail & Related papers (2022-05-26T20:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.