Related papers: "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

URL: http://arxiv.org/abs/2505.14226v3
Date: Sat, 11 Oct 2025 13:22:55 GMT
Title: "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs
Authors: Darpan Aswal, Siddharth D Jaiswal,
Abstract summary: We introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks.<n>Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts.<n>Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success.
Score: 1.2891210250935148
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently released LLMs have strong multilingual \& multimodal capabilities. Model vulnerabilities are exposed using audits and red-teaming efforts. Existing efforts have focused primarily on the English language; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially for multimodal contexts. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also present an extension to a current jailbreak-template-based strategy and propose a novel template, showing higher effectiveness than baselines. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. We achieve a 99\% Attack Success Rate for text generation and 78\% for image generation, with Attack Relevance Rate of 100\% for text generation and 96\% for image generation for the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words. \textit{\textbf{Warning: This paper contains examples of potentially harmful and offensive content.}}

Related papers

Say It Differently: Linguistic Styles as Jailbreak Vectors [0.763334557068953]
We study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models.<n>We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles.<n>Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants.
arXiv Detail & Related papers (2025-11-13T17:24:38Z)
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation [3.7898376145698744]
Large Language Models (LLMs) are susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts.<n>We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs.<n>We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists.
arXiv Detail & Related papers (2025-08-22T15:57:57Z)
JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering [73.962469626788]
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus.<n>We propose JPS, underlineJailbreak MLLMs with collaborative visual underlinePerturbation and textual underlineSteering.
arXiv Detail & Related papers (2025-08-07T07:14:01Z)
Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework [6.002582335323663]
We present an adversarial attack targeting the speech input of aligned Multimodal Large Language Models (MLLMs) in a white box scenario.<n>We introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences.<n>Our approach achieves up to 89 percent attack success rate across multiple restricted tasks.
arXiv Detail & Related papers (2025-05-24T20:46:36Z)
Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models [20.99874786089634]
Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision.<n>We propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via at least significant bit steganography.<n>On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.
arXiv Detail & Related papers (2025-05-22T09:34:47Z)
Multilingual and Multi-Accent Jailbreaking of Audio LLMs [19.5428160851918]
Multi-AudioJail is the first systematic framework to exploit multilingual and multi-accent audio jailbreaks.<n>We show how acoustic perturbations interact with cross-lingual phonetics to cause jailbreak success rates to surge.<n>We plan to release our dataset to spur research into cross-modal defenses.
arXiv Detail & Related papers (2025-04-01T18:12:23Z)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models [18.06388944779541]
"jailbreaking" is the use of large language models to trigger unintended behaviors.<n>We propose a novel method to balance the jailbreak attack success rate with semantic coherence.<n>Our method is superior to state-of-the-art baselines in attack effectiveness.
arXiv Detail & Related papers (2025-02-17T02:49:26Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)<n>Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.<n> Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z)
Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs) A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP) Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.