TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
- URL: http://arxiv.org/abs/2306.06815v3
- Date: Tue, 31 Oct 2023 00:33:53 GMT
- Title: TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
- Authors: Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau
Boloni and Qian Lou
- Abstract summary: TrojLLM is an automatic and black-box framework to generate universal and stealthy triggers.
It supports embedding Trojans within discrete prompts, enhancing the overall effectiveness and precision of the triggers' attacks.
Our experiments and results demonstrate TrojLLM's capacity to effectively insert Trojans into text prompts in real-world black-box LLM APIs.
- Score: 29.66515518909497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are progressively being utilized as machine
learning services and interface tools for various applications. However, the
security implications of LLMs, particularly in relation to adversarial and
Trojan attacks, remain insufficiently examined. In this paper, we propose
TrojLLM, an automatic and black-box framework to effectively generate universal
and stealthy triggers. When these triggers are incorporated into the input
data, the LLMs' outputs can be maliciously manipulated. Moreover, the framework
also supports embedding Trojans within discrete prompts, enhancing the overall
effectiveness and precision of the triggers' attacks. Specifically, we propose
a trigger discovery algorithm for generating universal triggers for various
inputs by querying victim LLM-based APIs using few-shot data samples.
Furthermore, we introduce a novel progressive Trojan poisoning algorithm
designed to generate poisoned prompts that retain efficacy and transferability
across a diverse range of models. Our experiments and results demonstrate
TrojLLM's capacity to effectively insert Trojans into text prompts in
real-world black-box LLM APIs including GPT-3.5 and GPT-4, while maintaining
exceptional performance on clean test sets. Our work sheds light on the
potential security risks in current models and offers a potential defensive
approach. The source code of TrojLLM is available at
https://github.com/UCF-ML-Research/TrojLLM.
Related papers
- Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - The Philosopher's Stone: Trojaning Plugins of Large Language Models [22.67696768099352]
Open-source Large Language Models (LLMs) have recently gained popularity because of their comparable performance to proprietary LLMs.
To efficiently fulfill domain-specialized tasks, open-source LLMs can be refined, without expensive accelerators, using low-rank adapters.
It is still unknown whether low-rank adapters can be exploited to control LLMs.
arXiv Detail & Related papers (2023-12-01T06:36:17Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - TrojText: Test-time Invisible Textual Trojan Insertion [18.866093947145654]
In Natural Language Processing (NLP), intelligent neuron models can be susceptible to textual Trojan attacks.
This paper proposes a solution called TrojText, which aims to determine whether invisible textual Trojan attacks can be performed more efficiently and cost-effectively without training data.
The proposed approach, called the Representation-Logit Trojan Insertion (RLI) algorithm, uses smaller sampled test data instead of large training data to achieve the desired attack.
arXiv Detail & Related papers (2023-03-03T22:19:22Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z) - An Adaptive Black-box Backdoor Detection Method for Deep Neural Networks [25.593824693347113]
Deep Neural Networks (DNNs) have demonstrated unprecedented performance across various fields such as medical diagnosis and autonomous driving.
They are identified to be vulnerable to Neural Trojan (NT) attacks that are controlled and activated by stealthy triggers.
We propose a robust and adaptive Trojan detection scheme that inspects whether a pre-trained model has been Trojaned before its deployment.
arXiv Detail & Related papers (2022-04-08T23:41:19Z) - Practical Detection of Trojan Neural Networks: Data-Limited and
Data-Free Cases [87.69818690239627]
We study the problem of the Trojan network (TrojanNet) detection in the data-scarce regime.
We propose a data-limited TrojanNet detector (TND), when only a few data samples are available for TrojanNet detection.
In addition, we propose a data-free TND, which can detect a TrojanNet without accessing any data samples.
arXiv Detail & Related papers (2020-07-31T02:00:38Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - An Embarrassingly Simple Approach for Trojan Attack in Deep Neural
Networks [59.42357806777537]
trojan attack aims to attack deployed deep neural networks (DNNs) relying on hidden trigger patterns inserted by hackers.
We propose a training-free attack approach which is different from previous work, in which trojaned behaviors are injected by retraining model on a poisoned dataset.
The proposed TrojanNet has several nice properties including (1) it activates by tiny trigger patterns and keeps silent for other signals, (2) it is model-agnostic and could be injected into most DNNs, dramatically expanding its attack scenarios, and (3) the training-free mechanism saves massive training efforts compared to conventional trojan attack methods.
arXiv Detail & Related papers (2020-06-15T04:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.