Causality Analysis for Evaluating the Security of Large Language Models
- URL: http://arxiv.org/abs/2312.07876v1
- Date: Wed, 13 Dec 2023 03:35:43 GMT
- Title: Causality Analysis for Evaluating the Security of Large Language Models
- Authors: Wei Zhao, Zhe Li, Jun Sun
- Abstract summary: Large Language Models (LLMs) are increasingly adopted in many safety-critical applications.
Recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks.
We propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level.
- Score: 9.102606258312246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted
in many safety-critical applications. Their security is thus essential. Even
with considerable efforts spent on reinforcement learning from human feedback
(RLHF), recent studies have shown that LLMs are still subject to attacks such
as adversarial perturbation and Trojan attacks. Further research is thus needed
to evaluate their security and/or understand the lack of it. In this work, we
propose a framework for conducting light-weight causality-analysis of LLMs at
the token, layer, and neuron level. We applied our framework to open-source
LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based
on a layer-level causality analysis, we show that RLHF has the effect of
overfitting a model to harmful prompts. It implies that such security can be
easily overcome by `unusual' harmful prompts. As evidence, we propose an
adversarial perturbation method that achieves 100\% attack success rate on the
red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we
show the existence of one mysterious neuron in both Llama2 and Vicuna that has
an unreasonably high causal effect on the output. While we are uncertain on why
such a neuron exists, we show that it is possible to conduct a ``Trojan''
attack targeting that particular neuron to completely cripple the LLM, i.e., we
can generate transferable suffixes to prompts that frequently make the LLM
produce meaningless responses.
Related papers
- TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP)
Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized.
We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z) - CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks.
We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities.
We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly [21.536079040559517]
Large Language Models (LLMs) have revolutionized natural language understanding and generation.
This paper explores the intersection of LLMs with security and privacy.
arXiv Detail & Related papers (2023-12-04T16:25:18Z) - Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z) - RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences.
Despite its advantages, RLHF relies on human annotators to rank the text.
We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.