ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
- URL: http://arxiv.org/abs/2505.11079v2
- Date: Tue, 08 Jul 2025 09:51:52 GMT
- Title: ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
- Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen,
- Abstract summary: Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse.<n>We propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: Is this audio fake or real?''<n>Experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios.
- Score: 57.29614630309265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: \textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ``Is this audio fake or real?''. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git
Related papers
- Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding [52.04807256534917]
Large Audio-Language Models (LALMs) can hallucinate what is presented in the audio.<n>We introduce Audio-Aware Decoding (AAD) to mitigate the hallucination of LALMs.<n>AAD uses contrastive decoding to compare the token prediction logits with and without the audio context.
arXiv Detail & Related papers (2025-06-08T17:36:50Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z) - Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio? [40.38305757279412]
Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs.
This paper investigate the effectiveness of current countermeasure (CM) against ALM-based audio.
Our findings reveal that the latest-trained CM can effectively detect ALM-based audio, achieving 0% equal error rate under most ALM test conditions.
arXiv Detail & Related papers (2024-08-20T13:45:34Z) - The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio [42.84634652376024]
ALM-based deepfake audio exhibits widespread, high deception, and type versatility.<n>To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method.<n>We propose the CSAM strategy to learn a domain balanced and generalized minima.
arXiv Detail & Related papers (2024-05-08T08:28:40Z) - Betray Oneself: A Novel Audio DeepFake Detection Model via
Mono-to-Stereo Conversion [70.99781219121803]
Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc.
We propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process.
arXiv Detail & Related papers (2023-05-25T02:54:29Z) - ADD 2022: the First Audio Deep Synthesis Detection Challenge [92.41777858637556]
The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap.
The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG)
arXiv Detail & Related papers (2022-02-17T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.