Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game
- URL: http://arxiv.org/abs/2404.02532v1
- Date: Wed, 3 Apr 2024 07:43:11 GMT
- Title: Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game
- Authors: Qianqiao Xu, Zhiliang Tian, Hongyan Wu, Zhen Huang, Yiping Song, Feng Liu, Dongsheng Li,
- Abstract summary: Malicious attackers induce large models to jailbreak and generate information containing illegal, privacy-invasive information.
Large models counter malicious attackers' attacks using techniques such as safety alignment.
We propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent.
- Score: 28.33029508522531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.
Related papers
- Versatile Defense Against Adversarial Attacks on Image Recognition [2.9980620769521513]
Defending against adversarial attacks in a real-life setting can be compared to the way antivirus software works.
It appears that a defense method based on image-to-image translation may be capable of this.
The trained model has successfully improved the classification accuracy from nearly zero to an average of 86%.
arXiv Detail & Related papers (2024-03-13T01:48:01Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Defense Against Model Extraction Attacks on Recommender Systems [53.127820987326295]
We introduce Gradient-based Ranking Optimization (GRO) to defend against model extraction attacks on recommender systems.
GRO aims to minimize the loss of the protected target model while maximizing the loss of the attacker's surrogate model.
Results show GRO's superior effectiveness in defending against model extraction attacks.
arXiv Detail & Related papers (2023-10-25T03:30:42Z) - Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review [15.179940846141873]
Applicating third-party data and models has become a new paradigm for language modeling in NLP.
backdoor attacks can induce the model to exhibit expected behaviors through specific triggers.
There is still no systematic and comprehensive review to reflect the security challenges, attacker's capabilities, and purposes.
arXiv Detail & Related papers (2023-09-12T08:48:38Z) - Efficient Defense Against Model Stealing Attacks on Convolutional Neural
Networks [0.548924822963045]
Model stealing attacks can lead to intellectual property theft and other security and privacy risks.
Current state-of-the-art defenses against model stealing attacks suggest adding perturbations to the prediction probabilities.
We propose a simple yet effective and efficient defense alternative.
arXiv Detail & Related papers (2023-09-04T22:25:49Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - MultiRobustBench: Benchmarking Robustness Against Multiple Attacks [86.70417016955459]
We present the first unified framework for considering multiple attacks against machine learning (ML) models.
Our framework is able to model different levels of learner's knowledge about the test-time adversary.
We evaluate the performance of 16 defended models for robustness against a set of 9 different attack types.
arXiv Detail & Related papers (2023-02-21T20:26:39Z) - I Know What You Trained Last Summer: A Survey on Stealing Machine
Learning Models and Defences [0.1031296820074812]
We study model stealing attacks, assessing their performance and exploring corresponding defence techniques in different settings.
We propose a taxonomy for attack and defence approaches, and provide guidelines on how to select the right attack or defence based on the goal and available resources.
arXiv Detail & Related papers (2022-06-16T21:16:41Z) - Widen The Backdoor To Let More Attackers In [24.540853975732922]
We investigate the scenario of a multi-agent backdoor attack, where multiple non-colluding attackers craft and insert triggered samples in a shared dataset.
We discover a clear backfiring phenomenon: increasing the number of attackers shrinks each attacker's attack success rate.
We then exploit this phenomenon to minimize the collective ASR of attackers and maximize defender's robustness accuracy.
arXiv Detail & Related papers (2021-10-09T13:53:57Z) - Learning to Attack: Towards Textual Adversarial Attacking in Real-world
Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples.
We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z) - Adversarial Imitation Attack [63.76805962712481]
A practical adversarial attack should require as little as possible knowledge of attacked models.
Current substitute attacks need pre-trained models to generate adversarial examples.
In this study, we propose a novel adversarial imitation attack.
arXiv Detail & Related papers (2020-03-28T10:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.