Related papers: The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

URL: http://arxiv.org/abs/2602.02557v1
Date: Fri, 30 Jan 2026 14:23:50 GMT
Title: The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models
Authors: Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, Adel Bibi,
Abstract summary: Cross-modality transfer of jailbreak attacks from text to audio is underexplored.<n>We show that text-transferred audio jailbreaks perform comparably to, and often better than, audio-based jailbreaks.
Score: 45.318255366335194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in end-to-end trained omni-models have significantly improved multimodal understanding. At the same time, safety red-teaming has expanded beyond text to encompass audio-based jailbreak attacks. However, an important bridge between textual and audio jailbreaks remains underexplored. In this work, we study the cross-modality transfer of jailbreak attacks from text to audio, motivated by the semantic similarity between the two modalities and the maturity of textual jailbreak methods. We first analyze the connection between modality alignment and cross-modality jailbreak transfer, showing that strong alignment can inadvertently propagate textual vulnerabilities to the audio modality, which we term the alignment curse. Guided by this analysis, we conduct an empirical evaluation of textual jailbreaks, text-transferred audio jailbreaks, and existing audio-based jailbreaks on recent omni-models. Our results show that text-transferred audio jailbreaks perform comparably to, and often better than, audio-based jailbreaks, establishing them as simple yet powerful baselines for future audio red-teaming. We further demonstrate strong cross-model transferability and show that text-transferred audio attacks remain effective even under a stricter audio-only access threat model.

Related papers

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models [32.08069972778743]
Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text.<n> multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses.<n>Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models.<n>We propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space.
arXiv Detail & Related papers (2026-02-01T05:18:47Z)
LLM Jailbreak Detection for (Almost) Free! [62.466970731998714]
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks.<n>Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences.<n>We propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts.
arXiv Detail & Related papers (2025-09-18T02:42:52Z)
Many-Turn Jailbreaking [65.04921693379944]
We propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than a single target query.<n>We construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models.
arXiv Detail & Related papers (2025-08-09T00:02:39Z)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models [19.373533532464915]
We introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs.<n>We use this dataset to evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks.<n>Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs.
arXiv Detail & Related papers (2025-05-21T11:47:47Z)
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models [19.59499038333469]
Jailbreak attacks to large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability.<n>We propose AudioJailbreak, a novel audio jailbreak attack featuring asynchrony, universality, stealthiness, and over-the-air robustness.
arXiv Detail & Related papers (2025-05-20T09:10:45Z)
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models [35.884976768636726]
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks.<n>Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs.<n>These advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attack.
arXiv Detail & Related papers (2025-01-23T15:51:38Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [4.547063832007314]
It is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other semantically-dissimilar classes. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model's perception of prompt harmfulness.
arXiv Detail & Related papers (2024-06-13T16:26:47Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.