Related papers: Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

URL: http://arxiv.org/abs/2312.04782v1
Date: Fri, 8 Dec 2023 01:41:36 GMT
Title: Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Authors: Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, Xiangyu Zhang
Abstract summary: We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
Score: 31.80386572346993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are now widely used in various applications, making it crucial to align their ethical standards with human values. However, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. In our study, we reveal a new threat to LLM alignment when a bad actor has access to the model's output logits, a common feature in both open-source LLMs and many commercial LLM APIs (e.g., certain GPT models). It does not rely on crafting specific prompts. Instead, it exploits the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. By forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. We term this process model interrogation. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. The harmful content uncovered through our method is more relevant, complete, and clear. Additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.

Related papers

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks. We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
An Engorgio Prompt Makes Large Language Model Babble on [25.148096060828397]
Auto-regressive large language models (LLMs) have yielded impressive performance in many real-world tasks. In this paper, we explore their vulnerability to inference cost attacks, where a malicious user crafts Engorgio prompts. We design Engorgio, a novel methodology, to efficiently generate adversarial Engorgio prompts to affect the target LLM's service availability.
arXiv Detail & Related papers (2024-12-27T01:00:23Z)
DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks. We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z)
Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks. This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions. Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Output Scouting: Auditing Large Language Models for Catastrophic Responses [1.5703117863274307]
Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. We propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution.
arXiv Detail & Related papers (2024-10-04T18:18:53Z)
Extracting Memorized Training Data via Decomposition [24.198975804570072]
We demonstrate a simple, query-based decompositional method to extract news articles from two frontier Large Language Models. We extract at least one sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities.
arXiv Detail & Related papers (2024-09-18T23:59:32Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition [10.476666078206783]
Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks. We propose PARDEN, which avoids the domain shift by simply asking the model to repeat its own outputs.
arXiv Detail & Related papers (2024-05-13T17:08:42Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? [49.99955642001019]
We show that open-sourced, aligned large language models could be easily misguided to generate undesired content. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
arXiv Detail & Related papers (2023-10-02T19:22:01Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.