Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
- URL: http://arxiv.org/abs/2412.13341v1
- Date: Tue, 17 Dec 2024 21:29:30 GMT
- Title: Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
- Authors: Keltin Grimes, Marco Christiani, David Shriver, Marissa Connor,
- Abstract summary: We show that editing techniques can integrate more complex behaviors with similar effectiveness.
We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which exhibit complex output behaviors.
Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models.
- Score: 4.281984287488243
- License:
- Abstract: Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts -- presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as 'computer science' or 'ancient civilizations.' When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models.
Related papers
- Trojan Detection Through Pattern Recognition for Large Language Models [0.8571111167616167]
Trojan backdoors can be injected into large language models at various stages.
We propose a multistage framework for detecting Trojan triggers in large language models.
arXiv Detail & Related papers (2025-01-20T17:36:04Z) - Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code [9.302681952761567]
This work investigates the application of Machine Unlearning (MU) for mitigating the impact of trojans embedded in large language models of natural language (Text-LLMs) and large language models of code (Code-LLMs)
arXiv Detail & Related papers (2024-08-22T14:12:06Z) - Stealth edits to large language models [76.53356051271014]
We show that a single metric can be used to assess a model's editability.
We also reveal the vulnerability of language models to stealth attacks.
arXiv Detail & Related papers (2024-06-18T14:43:18Z) - Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy [11.075592348442225]
Large language models (LLMs) have provided a lot of exciting new capabilities in software development.
The opaque nature of these models makes them difficult to reason about and inspect.
This work presents an overview of the current state-of-the-art trojan attacks on large language models of code.
arXiv Detail & Related papers (2024-05-05T06:43:52Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z) - TRIGS: Trojan Identification from Gradient-based Signatures [13.37492199234584]
Training machine learning models can be very expensive or even unaffordable.
Pre-trained models can be infected with Trojan attacks.
We present a novel method for detecting Trojan models.
arXiv Detail & Related papers (2023-06-08T02:17:29Z) - Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.