I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution
- URL: http://arxiv.org/abs/2506.17323v1
- Date: Wed, 18 Jun 2025 19:49:41 GMT
- Title: I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution
- Authors: Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, Norbert Tihanyi,
- Abstract summary: This paper presents the first systematic study of authorship attribution for C programs.<n>We release CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture.<n>Our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models.
- Score: 0.0580448704422069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting AI-generated code, deepfakes, and other synthetic content is an emerging research challenge. As code generated by Large Language Models (LLMs) becomes more common, identifying the specific model behind each sample is increasingly important. This paper presents the first systematic study of LLM authorship attribution for C programs. We released CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture, discarding the decoder to focus on classification. Our model's encoder output (first token) is passed through a two-layer classification head with GELU activation and dropout, producing a probability distribution over possible authors. To evaluate our approach, we introduce LLM-AuthorBench, a benchmark of 32,000 compilable C programs generated by eight state-of-the-art LLMs across diverse tasks. We compare our model to seven traditional ML classifiers and eight fine-tuned transformer models, including BERT, RoBERTa, CodeBERT, ModernBERT, DistilBERT, DeBERTa-V3, Longformer, and LoRA-fine-tuned Qwen2-1.5B. In binary classification, our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models such as GPT-4.1 and GPT-4o, and 95.40% accuracy for multi-class attribution among five leading LLMs (Gemini 2.5 Flash, Claude 3.5 Haiku, GPT-4.1, Llama 3.3, and DeepSeek-V3). To support open science, we release the CodeT5-Authorship architecture, the LLM-AuthorBench benchmark, and all relevant Google Colab scripts on GitHub: https://github.com/LLMauthorbench/.
Related papers
- Evaluating and Achieving Controllable Code Completion in Code LLM [89.64782747840225]
We present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench)<n>We reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks.<n>The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench.
arXiv Detail & Related papers (2026-01-22T11:40:04Z) - LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts [48.83701310501069]
We present NN-Caption, an LLM-guided neural architecture search pipeline.<n>It generates runnable image-captioning models by composing CNN encoders from LEMUR's classification backbones.<n>This work presents a pipeline that integrates prompt-based code generation with automatic evaluation.
arXiv Detail & Related papers (2025-12-07T10:47:28Z) - Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs [12.70863561286374]
We introduce a lifecycle-aware framework that incorporates intermediate artifacts into both the training and inference stages.<n> Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning.<n>Open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code.
arXiv Detail & Related papers (2025-10-28T02:54:02Z) - The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution [2.334824705384299]
We present the first large-scale study exploring whether JavaScript code generated by Large Language Models can reveal which model produced it.<n>We show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size.<n>We introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models.
arXiv Detail & Related papers (2025-10-12T07:51:03Z) - Evaluating the Use of LLMs for Documentation to Code Traceability [3.076436880934678]
Large Language Models can establish trace links between various software documentation and source code.<n>We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI)<n>Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets.
arXiv Detail & Related papers (2025-06-19T16:18:53Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution [10.538442986619147]
State-of-the-art large language models (LLMs) can successfully attribute source code authorship across different languages.<n>LLMs show some adversarial robustness against misattribution attacks.<n>We propose a tournament-style approach for large-scale attribution.
arXiv Detail & Related papers (2025-01-14T14:46:19Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors.
We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder.
We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z) - Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku [3.5411188659374213]
We perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset.
We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity.
We analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is.
arXiv Detail & Related papers (2024-09-02T17:25:15Z) - Applying RLAIF for Code Generation with API-usage in Lightweight LLMs [15.366324461797582]
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains.
This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (1B parameters) LLMs.
arXiv Detail & Related papers (2024-06-28T17:16:03Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants.<n>Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents.
We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.