CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code
- URL: http://arxiv.org/abs/2404.15639v2
- Date: Sun, 8 Sep 2024 08:50:38 GMT
- Title: CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code
- Authors: Batu Guan, Yao Wan, Zhangqian Bi, Zheng Wang, Hongyu Zhang, Pan Zhou, Lichao Sun,
- Abstract summary: Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
- Score: 56.019447113206006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable progress in code generation. It now becomes crucial to identify whether the code is AI-generated and to determine the specific model used, particularly for purposes such as protecting Intellectual Property (IP) in industry and preventing cheating in programming exercises. To this end, several attempts have been made to insert watermarks into machine-generated code. However, existing approaches are limited to inserting only a single bit of information or overly depending on particular code patterns. In this paper, we introduce CodeIP, a novel multi-bit watermarking technique that embeds additional information to preserve crucial provenance details, such as the vendor ID of an LLM, thereby safeguarding the IPs of LLMs in code generation. Furthermore, to ensure the syntactical correctness of the generated code, we propose constraining the sampling process for predicting the next token by training a type predictor. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP in watermarking LLMs for code generation while maintaining the syntactical correctness of code.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code.
We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Test-Case-Driven Programming Understanding in Large Language Models for
Better Code Generation [15.166827643436346]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)
It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.
muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.