Related papers: TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

URL: http://arxiv.org/abs/2402.12991v2
Date: Thu, 6 Jun 2024 17:46:48 GMT
Title: TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
Authors: Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh,
Abstract summary: We describe the novel fingerprinting problem of Black-box Identity Verification (BBIV) The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use.
Score: 41.25887364156612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.

Related papers

LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression. LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Understanding and Enhancing the Transferability of Jailbreaking Attacks [12.446931518819875]
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. This work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. We propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input.
arXiv Detail & Related papers (2025-02-05T10:29:54Z)
CALM: Curiosity-Driven Auditing for Large Language Models [27.302357350862085]
We propose Curiosity-Driven Auditing for Large Language Models (CALM) to finetune an LLM as the auditor agent. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting.
arXiv Detail & Related papers (2025-01-06T13:14:34Z)
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [89.9648814145473]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue. We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants. It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It) [16.673210422615348]
More than 10 new methods have been proposed to perform Membership Inference Attacks (MIAs) against LLMs. Contrary to traditional MIAs which rely on fixed -- but randomized -- records or models, these methods are mostly evaluated on datasets collected post-hoc. This lack of randomization raises concerns of a distribution shift between members and non-members.
arXiv Detail & Related papers (2024-06-25T23:12:07Z)
ObscurePrompt: Jailbreaking Large Language Models via Obscure Input [32.00508793605316]
We introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs. We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition [10.476666078206783]
Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks. We propose PARDEN, which avoids the domain shift by simply asking the model to repeat its own outputs.
arXiv Detail & Related papers (2024-05-13T17:08:42Z)
Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research [3.4245017707416157]
We present the evaluation of an LLM in identifying logical fallacies that will form part of a digital misinformation intervention. By comparing to a labeled dataset, we found that GPT-4 achieves an accuracy of 0.79, and for our intended use case that excludes invalid or unidentified instances, an accuracy of 0.90.
arXiv Detail & Related papers (2024-04-08T06:00:14Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.