The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution
- URL: http://arxiv.org/abs/2510.10493v1
- Date: Sun, 12 Oct 2025 07:51:03 GMT
- Title: The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution
- Authors: Norbert Tihanyi, Bilel Cherif, Richard A. Dubniczky, Mohamed Amine Ferrag, Tamás Bisztray,
- Abstract summary: We present the first large-scale study exploring whether JavaScript code generated by Large Language Models can reveal which model produced it.<n>We show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size.<n>We introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models.
- Score: 2.334824705384299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its decoder removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the LLM-NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/LLM-NodeJS-dataset.
Related papers
- From Memorization to Creativity: LLM as a Designer of Novel Neural-Architectures [48.83701310501069]
Large language models (LLMs) excel in program synthesis, yet their ability to autonomously navigate neural architecture design-balancing reliability, performance, and structural novelty--remains underexplored.<n>We address this by placing a code-oriented LLM within a closed-loop synthesis framework, analyzing its evolution over 22 supervised fine-tuning cycles.
arXiv Detail & Related papers (2026-01-06T13:20:28Z) - LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts [48.83701310501069]
We present NN-Caption, an LLM-guided neural architecture search pipeline.<n>It generates runnable image-captioning models by composing CNN encoders from LEMUR's classification backbones.<n>This work presents a pipeline that integrates prompt-based code generation with automatic evaluation.
arXiv Detail & Related papers (2025-12-07T10:47:28Z) - NNGPT: Rethinking AutoML with Large Language Models [36.90850535125572]
NNGPT is an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development.<n>It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyper parameter optimization, code-aware accuracy/early-stop prediction, and reinforcement learning.<n>The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine.
arXiv Detail & Related papers (2025-11-25T14:10:44Z) - I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution [0.0580448704422069]
This paper presents the first systematic study of authorship attribution for C programs.<n>We release CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture.<n>Our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models.
arXiv Detail & Related papers (2025-06-18T19:49:41Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution [10.538442986619147]
State-of-the-art large language models (LLMs) can successfully attribute source code authorship across different languages.<n>LLMs show some adversarial robustness against misattribution attacks.<n>We propose a tournament-style approach for large-scale attribution.
arXiv Detail & Related papers (2025-01-14T14:46:19Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.<n>SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.<n>We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors.
We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder.
We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z) - Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation [0.0]
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation.
We find that open LLMs can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd.
arXiv Detail & Related papers (2024-01-18T18:15:46Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - JEMMA: An Extensible Java Dataset for ML4Code Applications [34.76698017961728]
We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code)
Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks.
JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
arXiv Detail & Related papers (2022-12-18T17:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.