Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
- URL: http://arxiv.org/abs/2410.23861v1
- Date: Thu, 31 Oct 2024 12:11:17 GMT
- Title: Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
- Authors: Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari,
- Abstract summary: We show that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions.
Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark.
- Score: 50.89022445197919
- License:
- Abstract: Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining Large Language Models (LLMs) and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights [50.89022445197919]
We propose a speech-specific risk taxonomy, covering 8 risk categories under hostility (malicious sarcasm and threats), malicious imitation (age, gender, ethnicity), and stereotypical biases (age, gender, ethnicity)
Based on the taxonomy, we create a small-scale dataset for evaluating current LMMs capability in detecting these categories of risk.
arXiv Detail & Related papers (2024-06-25T10:08:45Z) - SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models [34.557309967708406]
In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking.
We design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement.
Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics.
arXiv Detail & Related papers (2024-05-14T04:51:23Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z) - Identifying False Content and Hate Speech in Sinhala YouTube Videos by
Analyzing the Audio [0.0]
This study proposes a solution to minimize the spread of violence and misinformation in Sinhala YouTube videos.
The approach involves developing a rating system that assesses whether a video contains false information by comparing the title and description with the audio content.
arXiv Detail & Related papers (2024-01-30T08:08:34Z) - FigStep: Jailbreaking Large Vision-language Models via Typographic
Visual Prompts [14.948652267916149]
We propose FigStep, a jailbreaking algorithm against large vision-language models (VLMs)
Instead of feeding textual harmful instructions directly, FigStep converts the harmful content into images through typography.
FigStep can achieve an average attack success rate of 82.50% on 500 harmful queries in 10 topics.
arXiv Detail & Related papers (2023-11-09T18:59:11Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach
for Violence Detection in Conversations [2.8038382295783943]
The choice of words and vocal cues in conversations presents an underexplored rich source of natural language data for personal safety and crime prevention.
We introduce a new information fusion approach that extracts and fuses multi-level features including verbal, vocal, and text as heterogeneous sources of information to detect the extent of violent behaviours in conversations.
arXiv Detail & Related papers (2022-06-23T16:45:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.