Related papers: JAB: Joint Adversarial Prompting and Belief Augmentation

JAB: Joint Adversarial Prompting and Belief Augmentation

URL: http://arxiv.org/abs/2311.09473v1
Date: Thu, 16 Nov 2023 00:35:54 GMT
Title: JAB: Joint Adversarial Prompting and Belief Augmentation
Authors: Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
Abstract summary: We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
Score: 81.39548637776365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.

Related papers

A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification [9.945272787814941]
We present a deep ensemble model that combines discriminative features with generative models to achieve both high accuracy and adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions.
arXiv Detail & Related papers (2024-12-28T05:06:20Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM) To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency [3.3490724063380215]
Adrial training has been presented as a mitigation strategy which can result in more robust models. We explore the effects of two different model compression methods -- structured weight pruning and quantization -- on adversarial robustness. We show that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models.
arXiv Detail & Related papers (2024-03-14T14:34:25Z)
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models [19.132597762214722]
Red-teaming or Jailbreaking large language models (LLMs) has emerged as a crucial area of study. This paper investigates the intricate consequences of such modifications through model editing. Our findings show that model editing serves as a cost-effective tool for topical red-teaming.
arXiv Detail & Related papers (2024-01-19T11:48:09Z)
FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z)
Introducing Foundation Models as Surrogate Models: Advancing Towards More Practical Adversarial Attacks [15.882687207499373]
No-box adversarial attacks are becoming more practical and challenging for AI systems. This paper recasts adversarial attack as a downstream task by introducing foundational models as surrogate models.
arXiv Detail & Related papers (2023-07-13T08:10:48Z)
SafeAMC: Adversarial training for robust modulation recognition models [53.391095789289736]
In communication systems, there are many tasks, like modulation recognition, which rely on Deep Neural Networks (DNNs) models. These models have been shown to be susceptible to adversarial perturbations, namely imperceptible additive noise crafted to induce misclassification. We propose to use adversarial training, which consists of fine-tuning the model with adversarial perturbations, to increase the robustness of automatic modulation recognition models.
arXiv Detail & Related papers (2021-05-28T11:29:04Z)
Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query. A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives. We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z)
Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution [83.02632136860976]
We study black-box adversarial attacks against deep neural networks (DNNs) We develop a novel mechanism of adversarial transferability, which is robust to the surrogate biases. Experiments on benchmark datasets and attacking against real-world API demonstrate the superior attack performance of the proposed method.
arXiv Detail & Related papers (2020-06-15T16:45:27Z)
Evaluating Ensemble Robustness Against Adversarial Attacks [0.0]
Adversarial examples, which are slightly perturbed inputs generated with the aim of fooling a neural network, are known to transfer between models. This concept of transferability poses grave security concerns as it leads to the possibility of attacking models in a black box setting. We introduce a gradient based measure of how effectively an ensemble's constituent models collaborate to reduce the space of adversarial examples targeting the ensemble itself.
arXiv Detail & Related papers (2020-05-12T13:20:54Z)
Luring of transferable adversarial perturbations in the black-box paradigm [0.0]
We present a new approach to improve the robustness of a model against black-box transfer attacks. A removable additional neural network is included in the target model, and is designed to induce the textitluring effect. Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set.
arXiv Detail & Related papers (2020-04-10T06:48:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.