JAB: Joint Adversarial Prompting and Belief Augmentation
- URL: http://arxiv.org/abs/2311.09473v1
- Date: Thu, 16 Nov 2023 00:35:54 GMT
- Title: JAB: Joint Adversarial Prompting and Belief Augmentation
- Authors: Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala,
Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
- Abstract summary: We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation.
This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
- Score: 81.39548637776365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the recent surge of language models in different applications, attention
to safety and robustness of these models has gained significant importance.
Here we introduce a joint framework in which we simultaneously probe and
improve the robustness of a black-box target model via adversarial prompting
and belief augmentation using iterative feedback loops. This framework utilizes
an automated red teaming approach to probe the target model, along with a
belief augmenter to generate instructions for the target model to improve its
robustness to those adversarial probes. Importantly, the adversarial model and
the belief generator leverage the feedback from past interactions to improve
the effectiveness of the adversarial prompts and beliefs, respectively. In our
experiments, we demonstrate that such a framework can reduce toxic content
generation both in dynamic cases where an adversary directly interacts with a
target model and static cases where we use a static benchmark dataset to
evaluate our model.
Related papers
- Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)
To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency [3.3490724063380215]
Adrial training has been presented as a mitigation strategy which can result in more robust models.
We explore the effects of two different model compression methods -- structured weight pruning and quantization -- on adversarial robustness.
We show that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models.
arXiv Detail & Related papers (2024-03-14T14:34:25Z) - Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models [19.132597762214722]
Red-teaming or Jailbreaking large language models (LLMs) has emerged as a crucial area of study.
This paper investigates the intricate consequences of such modifications through model editing.
Our findings show that model editing serves as a cost-effective tool for topical red-teaming.
arXiv Detail & Related papers (2024-01-19T11:48:09Z) - FLIRT: Feedback Loop In-context Red Teaming [71.38594755628581]
We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - Introducing Foundation Models as Surrogate Models: Advancing Towards
More Practical Adversarial Attacks [15.882687207499373]
No-box adversarial attacks are becoming more practical and challenging for AI systems.
This paper recasts adversarial attack as a downstream task by introducing foundational models as surrogate models.
arXiv Detail & Related papers (2023-07-13T08:10:48Z) - SafeAMC: Adversarial training for robust modulation recognition models [53.391095789289736]
In communication systems, there are many tasks, like modulation recognition, which rely on Deep Neural Networks (DNNs) models.
These models have been shown to be susceptible to adversarial perturbations, namely imperceptible additive noise crafted to induce misclassification.
We propose to use adversarial training, which consists of fine-tuning the model with adversarial perturbations, to increase the robustness of automatic modulation recognition models.
arXiv Detail & Related papers (2021-05-28T11:29:04Z) - Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query.
A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives.
We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z) - Boosting Black-Box Attack with Partially Transferred Conditional
Adversarial Distribution [83.02632136860976]
We study black-box adversarial attacks against deep neural networks (DNNs)
We develop a novel mechanism of adversarial transferability, which is robust to the surrogate biases.
Experiments on benchmark datasets and attacking against real-world API demonstrate the superior attack performance of the proposed method.
arXiv Detail & Related papers (2020-06-15T16:45:27Z) - Evaluating Ensemble Robustness Against Adversarial Attacks [0.0]
Adversarial examples, which are slightly perturbed inputs generated with the aim of fooling a neural network, are known to transfer between models.
This concept of transferability poses grave security concerns as it leads to the possibility of attacking models in a black box setting.
We introduce a gradient based measure of how effectively an ensemble's constituent models collaborate to reduce the space of adversarial examples targeting the ensemble itself.
arXiv Detail & Related papers (2020-05-12T13:20:54Z) - Luring of transferable adversarial perturbations in the black-box
paradigm [0.0]
We present a new approach to improve the robustness of a model against black-box transfer attacks.
A removable additional neural network is included in the target model, and is designed to induce the textitluring effect.
Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set.
arXiv Detail & Related papers (2020-04-10T06:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.