Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection
- URL: http://arxiv.org/abs/2601.23066v1
- Date: Fri, 30 Jan 2026 15:16:43 GMT
- Title: Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection
- Authors: Xiaoxuan Guo, Yuankun Xie, Haonan Cheng, Jiayi Zhou, Jian Liu, Hengyan Huang, Long Ye, Qin Zhang,
- Abstract summary: Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated.<n>Existing audio large language model (LLM)-based methods are often biased toward semantically correlated cues.<n>We introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM)
- Score: 23.695892348165497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.
Related papers
- Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs [22.8529107367745]
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness.<n>Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques.<n>We propose PELM, the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task.
arXiv Detail & Related papers (2026-01-29T09:39:28Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z) - Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features [5.678610585849838]
Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition.
Unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability.
This paper proposes a modified probing approach to explain deep learning embeddings in the speech emotion space.
arXiv Detail & Related papers (2024-09-14T19:18:56Z) - Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features [0.353122873734926]
Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity.
Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with Expert Defined Linguistic Features (EDLFs)
It is established that there is an improvement in several deepfake detection algorithms when they augmented the traditional and common features of audio data with EDLFs.
arXiv Detail & Related papers (2024-09-09T19:47:57Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings [18.282632348274756]
Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
arXiv Detail & Related papers (2021-10-14T11:25:02Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.