Related papers: Automated Adversarial Discovery for Safety Classifiers

Automated Adversarial Discovery for Safety Classifiers

URL: http://arxiv.org/abs/2406.17104v1
Date: Mon, 24 Jun 2024 19:45:12 GMT
Title: Automated Adversarial Discovery for Safety Classifiers
Authors: Yash Kumar Lal, Preethi Lahoti, Aradhana Sinha, Yao Qin, Ananth Balashankar,
Abstract summary: We formalize the task of automated adversarial discovery for safety classifiers. Our evaluation of existing attack generation methods on the CivilComments toxicity task reveals their limitations. Even our best-performing prompt-based method finds new successful attacks on unseen harm dimensions of attacks only 5% of the time.
Score: 10.61889194493287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety classifiers are critical in mitigating toxicity on online forums such as social media and in chatbots. Still, they continue to be vulnerable to emergent, and often innumerable, adversarial attacks. Traditional automated adversarial data generation methods, however, tend to produce attacks that are not diverse, but variations of previously observed harm types. We formalize the task of automated adversarial discovery for safety classifiers - to find new attacks along previously unseen harm dimensions that expose new weaknesses in the classifier. We measure progress on this task along two key axes (1) adversarial success: does the attack fool the classifier? and (2) dimensional diversity: does the attack represent a previously unseen harm type? Our evaluation of existing attack generation methods on the CivilComments toxicity task reveals their limitations: Word perturbation attacks fail to fool classifiers, while prompt-based LLM attacks have more adversarial success, but lack dimensional diversity. Even our best-performing prompt-based method finds new successful attacks on unseen harm dimensions of attacks only 5\% of the time. Automatically finding new harmful dimensions of attack is crucial and there is substantial headroom for future research on our new task.

Related papers

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis [3.795071937009966]
Adrial attacks can jeopardize the integrity of Machine Learning (ML) models. We propose a framework that detects if an adversarial noise instance is being generated. We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks.
arXiv Detail & Related papers (2025-03-04T20:25:12Z)
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [88.96201324719205]
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory.
arXiv Detail & Related papers (2024-10-14T16:41:49Z)
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources. Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker. Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z)
A Dual-Tier Adaptive One-Class Classification IDS for Emerging Cyberthreats [3.560574387648533]
We propose a one-class classification-driven IDS system structured on two tiers. The first tier distinguishes between normal activities and attacks/threats, while the second tier determines if the detected attack is known or unknown. This model not only identifies unseen attacks but also uses them for retraining them by clustering unseen attacks.
arXiv Detail & Related papers (2024-03-17T12:26:30Z)
Understanding the Vulnerability of Skeleton-based Human Activity Recognition via Black-box Attack [53.032801921915436]
Human Activity Recognition (HAR) has been employed in a wide range of applications, e.g. self-driving cars. Recently, the robustness of skeleton-based HAR methods have been questioned due to their vulnerability to adversarial attacks. We show such threats exist, even when the attacker only has access to the input/output of the model. We propose the very first black-box adversarial attack approach in skeleton-based HAR called BASAR.
arXiv Detail & Related papers (2022-11-21T09:51:28Z)
Preserving Semantics in Textual Adversarial Attacks [0.0]
Up to 70% of adversarial examples generated by adversarial attacks should be discarded because they do not preserve semantics. We propose a new, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE) Our method outperforms existing sentence encoders used in adversarial attacks by achieving 1.2x - 5.1x better real attack success rate.
arXiv Detail & Related papers (2022-11-08T12:40:07Z)
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers. Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
Defending Black-box Skeleton-based Human Activity Classifiers [38.95979614080714]
In this paper, we investigate skeleton-based Human Activity Recognition, which is an important type of time-series data but under-explored in defense against attacks. We name our framework Bayesian Energy-based Adversarial Training or BEAT. BEAT is straightforward but elegant, which turns vulnerable black-box classifiers into robust ones without sacrificing accuracy.
arXiv Detail & Related papers (2022-03-09T13:46:10Z)
ROOM: Adversarial Machine Learning Attacks Under Real-Time Constraints [3.042299765078767]
We show how an offline component serves to warm up the online algorithm, making it possible to generate highly successful attacks under time constraints. This paper introduces a new problem: how do we generate adversarial noise under real-time constraints to support such real-time adversarial attacks?
arXiv Detail & Related papers (2022-01-05T14:03:26Z)
Universal Adversarial Attacks with Natural Triggers for Text Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems. Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z)
Adversarial Fooling Beyond "Flipping the Label" [54.23547006072598]
CNNs show near human or better than human performance in many critical tasks. These attacks are potentially dangerous in real-life deployments. We present a comprehensive analysis of several important adversarial attacks over a set of distinct CNN architectures.
arXiv Detail & Related papers (2020-04-27T13:21:03Z)
Temporal Sparse Adversarial Attack on Sequence-based Gait Recognition [56.844587127848854]
We demonstrate that the state-of-the-art gait recognition model is vulnerable to such attacks. We employ a generative adversarial network based architecture to semantically generate adversarial high-quality gait silhouettes or video frames. The experimental results show that if only one-fortieth of the frames are attacked, the accuracy of the target model drops dramatically.
arXiv Detail & Related papers (2020-02-22T10:08:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.