Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak
- URL: http://arxiv.org/abs/2501.13772v1
- Date: Thu, 23 Jan 2025 15:51:38 GMT
- Title: Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak
- Authors: Erjia Xiao, Hao Cheng, Jing Shao, Jinhao Duan, Kaidi Xu, Le Yang, Jindong Gu, Renjing Xu,
- Abstract summary: This paper investigates how audio-specific edits influence Large Audio-Language Models (LALMs) inference regarding jailbreak.
We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection.
We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits.
- Score: 35.62727804915181
- License:
- Abstract: Large Language Models (LLMs) demonstrate remarkable zero-shot performance across various natural language processing tasks. The integration of multimodal encoders extends their capabilities, enabling the development of Multimodal Large Language Models that process vision, audio, and text. However, these capabilities also raise significant security concerns, as these models can be manipulated to generate harmful or inappropriate content through jailbreak. While extensive research explores the impact of modality-specific input edits on text-based LLMs and Large Vision-Language Models in jailbreak, the effects of audio-specific edits on Large Audio-Language Models (LALMs) remain underexplored. Hence, this paper addresses this gap by investigating how audio-specific edits influence LALMs inference regarding jailbreak. We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection, and the Edited Audio Datasets (EADs), a comprehensive audio jailbreak benchmark. We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits. This work lays the groundwork for future explorations on audio-modality interactions in LALMs security.
Related papers
- "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models [0.9480364746270077]
This paper explores audio jailbreaks targeting Audio-Language Models (ALMs)
We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples.
We analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech.
arXiv Detail & Related papers (2025-02-02T08:36:23Z) - Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains.
We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation.
Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z) - Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models [49.87432626548563]
We introduce methods to assess the extent of object hallucination of publicly available LALMs.
Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content.
We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
arXiv Detail & Related papers (2024-06-12T16:51:54Z) - AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities.
The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Prompting Large Language Models with Speech Recognition Abilities [31.77576008965215]
We extend the capabilities of large language models by directly attaching a small audio encoder allowing it to perform speech recognition.
Experiments on MultilingualSpeech show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18%.
arXiv Detail & Related papers (2023-07-21T08:39:15Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.