Trojaning Language Models for Fun and Profit
- URL: http://arxiv.org/abs/2008.00312v2
- Date: Wed, 10 Mar 2021 21:52:58 GMT
- Title: Trojaning Language Models for Fun and Profit
- Authors: Xinyang Zhang, Zheng Zhang, Shouling Ji and Ting Wang
- Abstract summary: TROJAN-LM is a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction.
By empirically studying three state-of-the-art LMs in a range of security-critical NLP tasks, we demonstrate that TROJAN-LM possesses the following properties.
- Score: 53.45727748224679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the emergence of a new paradigm of building
natural language processing (NLP) systems: general-purpose, pre-trained
language models (LMs) are composed with simple downstream models and fine-tuned
for a variety of NLP tasks. This paradigm shift significantly simplifies the
system development cycles. However, as many LMs are provided by untrusted third
parties, their lack of standardization or regulation entails profound security
implications, which are largely unexplored.
To bridge this gap, this work studies the security threats posed by malicious
LMs to NLP systems. Specifically, we present TROJAN-LM, a new class of
trojaning attacks in which maliciously crafted LMs trigger host NLP systems to
malfunction in a highly predictable manner. By empirically studying three
state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP
tasks (toxic comment detection, question answering, text completion) as well as
user studies on crowdsourcing platforms, we demonstrate that TROJAN-LM
possesses the following properties: (i) flexibility - the adversary is able to
flexibly dene logical combinations (e.g., 'and', 'or', 'xor') of arbitrary
words as triggers, (ii) efficacy - the host systems misbehave as desired by the
adversary with high probability when trigger-embedded inputs are present, (iii)
specificity - the trojan LMs function indistinguishably from their benign
counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs
appear as fluent natural language and highly relevant to their surrounding
contexts. We provide analytical justification for the practicality of
TROJAN-LM, and further discuss potential countermeasures and their challenges,
which lead to several promising research directions.
Related papers
- Advancing NLP Security by Leveraging LLMs as Adversarial Engines [3.7238716667962084]
We propose a novel approach to advancing NLP security by leveraging Large Language Models (LLMs) as engines for generating diverse adversarial attacks.
We argue for expanding this concept to encompass a broader range of attack types, including adversarial patches, universal perturbations, and targeted attacks.
This paradigm shift in adversarial NLP has far-reaching implications, potentially enhancing model robustness, uncovering new vulnerabilities, and driving innovation in defense mechanisms.
arXiv Detail & Related papers (2024-10-23T18:32:03Z) - SoK: Prompt Hacking of Large Language Models [5.056128048855064]
The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence.
We offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection.
We propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification.
arXiv Detail & Related papers (2024-10-16T01:30:41Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability [44.99833362998488]
Large Language Models (LLMs) have shown impressive performance across a wide range of tasks.
LLMs in particular are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model.
We propose a method, based on Mechanistic Interpretability (MI) techniques, to guide this process.
arXiv Detail & Related papers (2024-07-29T09:55:34Z) - garak: A Framework for Security Probing Large Language Models [16.305837349514505]
garak is a framework which can be used to discover and identify vulnerabilities in a target Large Language Models (LLMs)
The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts.
arXiv Detail & Related papers (2024-06-16T18:18:43Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative [55.08395463562242]
Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI)
Our paper explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content.
arXiv Detail & Related papers (2024-02-20T23:08:21Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.