Fingerprinting LLMs via Prompt Injection
- URL: http://arxiv.org/abs/2509.25448v2
- Date: Wed, 01 Oct 2025 14:04:38 GMT
- Title: Fingerprinting LLMs via Prompt Injection
- Authors: Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong,
- Abstract summary: Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization.<n>Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing.<n>We propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection.
- Score: 16.907123772391213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.
Related papers
- iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification [22.052342142871144]
iSeal is a fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner.<n>It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy.<n>iSeal achieves 100 percent Fingerprint Success Rate on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.
arXiv Detail & Related papers (2025-11-12T02:30:19Z) - Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation [33.83669045868836]
Black-box methods often fail to generate distinctive Large Language Models fingerprints.<n>We propose ZeroPrint, a novel method that approximates information-rich gradients in a black-box setting using zeroth-order estimation.<n>Experiments on the standard benchmark show ZeroPrint achieves a state-of-the-art effectiveness and robustness, significantly outperforming existing black-box methods.
arXiv Detail & Related papers (2025-10-08T03:27:38Z) - SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From [65.75182441010327]
We propose a stronger and more intrinsic notion of LLM fingerprinting: SeedPrints.<n>We show that untrained models exhibit reproducible token selection biases conditioned solely on their parameters.<n> Experiments on LLaMA-style and Qwen-style models show that SeedPrints achieves seed-level distinguishability and can provide birth-to-lifecycle identity verification akin to a biometric fingerprint.
arXiv Detail & Related papers (2025-09-30T15:34:08Z) - SoK: Large Language Model Copyright Auditing via Fingerprinting [69.14570598973195]
We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches.<n>We propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios.
arXiv Detail & Related papers (2025-08-27T12:56:57Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification [9.780530666330007]
Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse.<n>In this paper, we introduce a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens.<n>Our method has minimal overhead and impact on model's performance, and does not require white-box access to target model's ownership identification.
arXiv Detail & Related papers (2024-10-16T07:36:57Z) - DALD: Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z) - ProFLingo: A Fingerprinting-based Intellectual Property Protection Scheme for Large Language Models [18.46904928949022]
We propose ProFLingo, a black-box fingerprinting-based IP protection scheme for large language models (LLMs)
ProFLingo generates queries that elicit specific responses from an original model, thereby establishing unique fingerprints.
Our scheme assesses the effectiveness of these queries on a suspect model to determine whether it has been derived from the original model.
arXiv Detail & Related papers (2024-05-03T20:00:40Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.<n>We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.<n>Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - Instructional Fingerprinting of Large Language Models [57.72356846657551]
We present a pilot study on fingerprinting Large language models (LLMs) as a form of very lightweight instruction tuning.
Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model.
It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License.
arXiv Detail & Related papers (2024-01-21T09:51:45Z) - HuRef: HUman-REadable Fingerprint for Large Language Models [44.9820558213721]
HuRef is a human-readable fingerprint for large language models.<n>It uniquely identifies the base model without interfering with training or exposing model parameters to the public.
arXiv Detail & Related papers (2023-12-08T05:01:47Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.