CodeMark: Imperceptible Watermarking for Code Datasets against Neural
Code Completion Models
- URL: http://arxiv.org/abs/2308.14401v1
- Date: Mon, 28 Aug 2023 08:36:53 GMT
- Title: CodeMark: Imperceptible Watermarking for Code Datasets against Neural
Code Completion Models
- Authors: Zhensu Sun, Xiaoning Du, Fu Song, Li Li
- Abstract summary: We propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models.
CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers.
- Score: 12.15157050363382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code datasets are of immense value for training neural-network-based code
completion models, where companies or organizations have made substantial
investments to establish and process these datasets. Unluckily, these datasets,
either built for proprietary or public usage, face the high risk of
unauthorized exploits, resulting from data leakages, license violations, etc.
Even worse, the ``black-box'' nature of neural models sets a high barrier for
externals to audit their training datasets, which further connives these
unauthorized usages. Currently, watermarking methods have been proposed to
prohibit inappropriate usage of image and natural language datasets. However,
due to domain specificity, they are not directly applicable to code datasets,
leaving the copyright protection of this emerging and important field of code
data still exposed to threats. To fill this gap, we propose a method, named
CodeMark, to embed user-defined imperceptible watermarks into code datasets to
trace their usage in training neural code completion models. CodeMark is based
on adaptive semantic-preserving transformations, which preserve the exact
functionality of the code data and keep the changes covert against
rule-breakers. We implement CodeMark in a toolkit and conduct an extensive
evaluation of code completion models. CodeMark is validated to fulfill all
desired properties of practical watermarks, including harmlessness to model
accuracy, verifiability, robustness, and imperceptibility.
Related papers
- Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking [51.74368870268278]
We propose TRACE, a framework for fully black-box detection of copyrighted dataset usage in large language models.<n>textttTRACE rewrites datasets with distortion-free watermarks guided by a private key.<n>Across diverse datasets and model families, TRACE consistently achieves significant detections.
arXiv Detail & Related papers (2025-10-03T12:53:02Z) - RepoMark: A Data-Usage Auditing Framework for Code Large Language Models [16.976151053365385]
We propose a novel data marking framework RepoMark to audit the data usage of code LLMs.<n>Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation.<n>RepoMark achieves a detection success rate over 90% on small code repositories under a strict FDR guarantee of 5%.
arXiv Detail & Related papers (2025-08-29T09:01:34Z) - Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models [37.817691840557984]
CSMs face risks of exploitation by unauthorized users.
Traditional watermarking methods require separate design of triggers and watermark features.
We propose ModMark, a novel model-level digital watermark embedding method.
arXiv Detail & Related papers (2024-10-18T00:48:00Z) - Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations [73.94175015918059]
We introduce a novel approach, EnTruth, which Enhances Traceability of unauthorized dataset usage.
By strategically incorporating the template memorization, EnTruth can trigger the specific behavior in unauthorized models as the evidence of infringement.
Our method is the first to investigate the positive application of memorization and use it for copyright protection, which turns a curse into a blessing.
arXiv Detail & Related papers (2024-06-20T02:02:44Z) - Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code [13.135962181354465]
Code auditing ensures that developed code adheres to standards, regulations, and copyright protection.
The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing.
We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
arXiv Detail & Related papers (2024-02-14T16:41:35Z) - ClearMark: Intuitive and Robust Model Watermarking via Transposed Model
Training [50.77001916246691]
This paper introduces ClearMark, the first DNN watermarking method designed for intuitive human assessment.
ClearMark embeds visible watermarks, enabling human decision-making without rigid value thresholds.
It shows an 8,544-bit watermark capacity comparable to the strongest existing work.
arXiv Detail & Related papers (2023-10-25T08:16:55Z) - Who Wrote this Code? Watermarking for Code Generation [53.24895162874416]
We propose Selective WatErmarking via Entropy Thresholding (SWEET) to detect machine-generated text.
Our experiments show that SWEET significantly improves code quality preservation while outperforming all baselines.
arXiv Detail & Related papers (2023-05-24T11:49:52Z) - Towards Tracing Code Provenance with Code Watermarking [37.41260851333952]
We propose CodeMark, a watermarking system that hides bit strings into variables respecting the natural and operational semantics of the code.
For naturalness, we introduce a contextual watermarking scheme to generate watermarked variables more coherent in the context atop graph neural networks.
We show CodeMark outperforms the SOTA watermarking systems with a better balance of the watermarking requirements.
arXiv Detail & Related papers (2023-05-21T13:53:12Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - Did You Train on My Dataset? Towards Public Dataset Protection with
Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data.
By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders.
This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.