Code Membership Inference for Detecting Unauthorized Data Use in Code
Pre-trained Language Models
- URL: http://arxiv.org/abs/2312.07200v1
- Date: Tue, 12 Dec 2023 12:07:54 GMT
- Title: Code Membership Inference for Detecting Unauthorized Data Use in Code
Pre-trained Language Models
- Authors: Sheng Zhang, Hui Li
- Abstract summary: This paper launches the first study of detecting unauthorized code use in CPLMs.
We design a framework Buzzer for different settings of Code Membership Inference task.
- Score: 7.6875396255520405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code pre-trained language models (CPLMs) have received great attention since
they can benefit various tasks that facilitate software development and
maintenance. However, CPLMs are trained on massive open-source code, raising
concerns about potential data infringement. This paper launches the first study
of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference
(CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer
deploys several inference techniques, including distilling the target CPLM,
ensemble inference, and unimodal and bimodal calibration. Extensive experiments
show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer
can serve as a CMI tool and help protect intellectual property rights.
Related papers
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning, tasks and agent systems.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an open cookbook'' for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs.
Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code.
Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code [13.135962181354465]
Code auditing ensures that developed code adheres to standards, regulations, and copyright protection.
The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing.
We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
arXiv Detail & Related papers (2024-02-14T16:41:35Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Differentially Private Decoding in Large Language Models [14.221692239892207]
We propose a simple, easy to interpret, and computationally lightweight perturbation mechanism to be applied to an already trained model at the decoding stage.
Our perturbation mechanism is model-agnostic and can be used in conjunction with any Large Language Model.
arXiv Detail & Related papers (2022-05-26T20:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.