Antelope: Potent and Concealed Jailbreak Attack Strategy
- URL: http://arxiv.org/abs/2412.08156v1
- Date: Wed, 11 Dec 2024 07:22:51 GMT
- Title: Antelope: Potent and Concealed Jailbreak Attack Strategy
- Authors: Xin Zhao, Xiaojun Chen, Haoyu Gao,
- Abstract summary: Antelope is a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models.
We successfully exploit the transferability of model-based attacks to penetrate online black-box services.
- Score: 7.970002819722513
- License:
- Abstract: Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.
Related papers
- Evaluating the Robustness of LiDAR Point Cloud Tracking Against Adversarial Attack [6.101494710781259]
We introduce a unified framework for conducting adversarial attacks within the context of 3D object tracking.
In addressing black-box attack scenarios, we introduce a novel transfer-based approach, the Target-aware Perturbation Generation (TAPG) algorithm.
Our experimental findings reveal a significant vulnerability in advanced tracking methods when subjected to both black-box and white-box attacks.
arXiv Detail & Related papers (2024-10-28T10:20:38Z) - You Know What I'm Saying: Jailbreak Attack via Implicit Reference [22.520950422702757]
This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR)
AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context.
Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models.
arXiv Detail & Related papers (2024-10-04T18:42:57Z) - The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? [23.347349690954452]
Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks.
We provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness.
We propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness.
arXiv Detail & Related papers (2024-10-02T11:40:49Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Principles of Designing Robust Remote Face Anti-Spoofing Systems [60.05766968805833]
This paper sheds light on the vulnerabilities of state-of-the-art face anti-spoofing methods against digital attacks.
It presents a comprehensive taxonomy of common threats encountered in face anti-spoofing systems.
arXiv Detail & Related papers (2024-06-06T02:05:35Z) - Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks [62.036798488144306]
Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked.
We propose an attack-agnostic defense method named Meta Invariance Defense (MID)
We show that MID simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration.
arXiv Detail & Related papers (2024-04-04T10:10:38Z) - Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks [41.531913152661296]
We formulate the problem of targeted adversarial attack on Stable Diffusion and propose a framework to generate adversarial prompts.
Specifically, we design a gradient-based embedding optimization method to craft reliable adversarial prompts that guide stable diffusion to generate specific images.
After obtaining successful adversarial prompts, we reveal the mechanisms that cause the vulnerability of the model.
arXiv Detail & Related papers (2024-01-16T12:15:39Z) - Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme.
Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Temporal Sparse Adversarial Attack on Sequence-based Gait Recognition [56.844587127848854]
We demonstrate that the state-of-the-art gait recognition model is vulnerable to such attacks.
We employ a generative adversarial network based architecture to semantically generate adversarial high-quality gait silhouettes or video frames.
The experimental results show that if only one-fortieth of the frames are attacked, the accuracy of the target model drops dramatically.
arXiv Detail & Related papers (2020-02-22T10:08:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.