Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach
- URL: http://arxiv.org/abs/2511.07033v1
- Date: Mon, 10 Nov 2025 12:29:09 GMT
- Title: Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach
- Authors: Yuanheng Li, Zhuoyang Chen, Xiaoyun Liu, Yuhao Wang, Mingwei Liu, Yang Shi, Kaifeng Huang, Shengjie Zhao,
- Abstract summary: Open-source code, often protected by open source licenses, poses legal and ethical challenges when used in pretraining.<n>We propose SynPrune, a syntax-pruned membership inference attack method tailored for code.
- Score: 20.775027150345107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) become increasingly capable, concerns over the unauthorized use of copyrighted and licensed content in their training data have grown, especially in the context of code. Open-source code, often protected by open source licenses (e.g, GPL), poses legal and ethical challenges when used in pretraining. Detecting whether specific code samples were included in LLM training data is thus critical for transparency, accountability, and copyright compliance. We propose SynPrune, a syntax-pruned membership inference attack method tailored for code. Unlike prior MIA approaches that treat code as plain text, SynPrune leverages the structured and rule-governed nature of programming languages. Specifically, it identifies and excludes consequent tokens that are syntactically required and not reflective of authorship, from attribution when computing membership scores. Experimental results show that SynPrune consistently outperforms the state-of-the-arts. Our method is also robust across varying function lengths and syntax categories.
Related papers
- Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning [8.571111167616165]
Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity.<n>We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection.<n>We propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet.
arXiv Detail & Related papers (2025-06-06T13:23:37Z) - Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning [13.725832389453911]
Citation classification is pivotal for scholarly analysis.<n>Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification.<n>We present a novel framework, Citss, that adapts the PLMs to overcome these challenges.
arXiv Detail & Related papers (2025-05-20T15:05:27Z) - Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z) - Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features [5.774786149181392]
alicious users can exploit large language models (LLMs) to produce paraphrased versions of proprietary code that closely resemble the original.<n>We develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code.<n> LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively.
arXiv Detail & Related papers (2025-02-25T00:58:06Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants.<n>Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code [13.135962181354465]
Code auditing ensures that developed code adheres to standards, regulations, and copyright protection.
The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing.
We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
arXiv Detail & Related papers (2024-02-14T16:41:35Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.