Related papers: How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

URL: http://arxiv.org/abs/2602.16343v1
Date: Wed, 18 Feb 2026 10:29:07 GMT
Title: How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection
Authors: Yixuan Xiao, Florian Lux, Alejandro Pérez-González-de-Martos, Ngoc Thang Vu,
Abstract summary: Recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker.<n>We examine how different labeling choices affect detection performance and provide insights into labeling strategies.
Score: 60.88800374832363
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

Related papers

A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge. We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z)
Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [9.48476556434306]
This paper extends the channel augmentation to work with non-differentiable traditional audio codecs and neural audio codecs. Listening tests demonstrate collaborative watermarking incurs negligible perceptual degradation with high codecs or DAC at 8kbps.
arXiv Detail & Related papers (2024-09-20T10:33:17Z)
Learning Source Disentanglement in Neural Audio Codec [20.335701584949526]
We introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation.<n>By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations.<n> Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space.
arXiv Detail & Related papers (2024-09-17T14:21:02Z)
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.<n>X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.<n>Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio [53.134423013599914]
We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesized by eight state-of-the-art vocoders.
arXiv Detail & Related papers (2022-08-20T09:23:21Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
RawNet: Fast End-to-End Neural Vocoder [4.507860128918788]
RawNet is a complete end-to-end neural vocoder based on the auto-encoder structure for speaker-dependent and -independent speech synthesis. It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner.
arXiv Detail & Related papers (2019-04-10T10:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.