Large Language Models Are Effective Code Watermarkers
- URL: http://arxiv.org/abs/2510.11251v1
- Date: Mon, 13 Oct 2025 10:40:24 GMT
- Title: Large Language Models Are Effective Code Watermarkers
- Authors: Rui Xu, Jiawei Chen, Zhaoxia Yin, Cong Kong, Xinpeng Zhang,
- Abstract summary: Watermarking has emerged as a promising solution for source attribution.<n>CodeMark-LLM embeds watermark into source code without compromising its semantics or readability.
- Score: 23.085224961348015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread use of large language models (LLMs) and open-source code has raised ethical and security concerns regarding the distribution and attribution of source code, including unauthorized redistribution, license violations, and misuse of code for malicious purposes. Watermarking has emerged as a promising solution for source attribution, but existing techniques rely heavily on hand-crafted transformation rules, abstract syntax tree (AST) manipulation, or task-specific training, limiting their scalability and generality across languages. Moreover, their robustness against attacks remains limited. To address these limitations, we propose CodeMark-LLM, an LLM-driven watermarking framework that embeds watermark into source code without compromising its semantics or readability. CodeMark-LLM consists of two core components: (i) Semantically Consistent Embedding module that applies functionality-preserving transformations to encode watermark bits, and (ii) Differential Comparison Extraction module that identifies the applied transformations by comparing the original and watermarked code. Leveraging the cross-lingual generalization ability of LLM, CodeMark-LLM avoids language-specific engineering and training pipelines. Extensive experiments across diverse programming languages and attack scenarios demonstrate its robustness, effectiveness, and scalability.
Related papers
- Can Large Language Models Understand, Reason About, and Generate Code-Switched Text? [26.210664542372168]
Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood.<n>We introduce CodeMixQA, a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants.<n>We analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs.
arXiv Detail & Related papers (2026-01-12T02:52:38Z) - Majority Bit-Aware Watermarking For Large Language Models [7.200910949076064]
MajorMark is a novel watermarking method that improves this trade-off through majority bit-aware encoding.<n>In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy.<n>Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality.
arXiv Detail & Related papers (2025-08-05T18:19:00Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models [37.817691840557984]
CSMs face risks of exploitation by unauthorized users.<n>Traditional watermarking methods require separate design of triggers and watermark features.<n>We propose ModMark, a novel model-level digital watermark embedding method.
arXiv Detail & Related papers (2024-10-18T00:48:00Z) - Learnable Item Tokenization for Generative Recommendation [113.80559032128065]
We propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity.<n> LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias.
arXiv Detail & Related papers (2024-05-12T15:49:38Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens.
We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z) - A Robust Semantics-based Watermark for Large Language Model against Paraphrasing [50.84892876636013]
Large language models (LLMs) have show great ability in various natural language tasks.
There are concerns that LLMs are possible to be used improperly or even illegally.
We propose a semantics-based watermark framework SemaMark.
arXiv Detail & Related papers (2023-11-15T06:19:02Z) - REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models [16.243415709584077]
We present REMARK-LLM, a novel efficient, and robust watermarking framework for large language models (LLMs)
ReMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content.
It exhibits better resilience against a spectrum of watermark detection and removal attacks.
arXiv Detail & Related papers (2023-10-18T22:14:37Z) - Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism.
Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs.
We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z) - Towards Tracing Code Provenance with Code Watermarking [37.41260851333952]
We propose CodeMark, a watermarking system that hides bit strings into variables respecting the natural and operational semantics of the code.
For naturalness, we introduce a contextual watermarking scheme to generate watermarked variables more coherent in the context atop graph neural networks.
We show CodeMark outperforms the SOTA watermarking systems with a better balance of the watermarking requirements.
arXiv Detail & Related papers (2023-05-21T13:53:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.