Audio Deepfake Detection Based on a Combination of F0 Information and
Real Plus Imaginary Spectrogram Features
- URL: http://arxiv.org/abs/2208.01214v1
- Date: Tue, 2 Aug 2022 02:46:16 GMT
- Title: Audio Deepfake Detection Based on a Combination of F0 Information and
Real Plus Imaginary Spectrogram Features
- Authors: Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, Chengshi
Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao
- Abstract summary: Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task.
Our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
- Score: 51.924340387119415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, pioneer research works have proposed a large number of acoustic
features (log power spectrogram, linear frequency cepstral coefficients,
constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining
good performance, and showing that different subbands have different
contributions to audio deepfake detection. However, this lacks an explanation
of the specific information in the subband, and these features also lose
information such as phase. Inspired by the mechanism of synthetic speech, the
fundamental frequency (F0) information is used to improve the quality of
synthetic speech, while the F0 of synthetic speech is still too average, which
differs significantly from that of real speech. It is expected that F0 can be
used as important information to discriminate between bonafide and fake speech,
while this information cannot be used directly due to the irregular
distribution of F0. Insteadly, the frequency band containing most of F0 is
selected as the input feature. Meanwhile, to make full use of the phase and
full-band information, we also propose to use real and imaginary spectrogram
features as complementary input features and model the disjoint subbands
separately. Finally, the results of F0, real and imaginary spectrogram features
are fused. Experimental results on the ASVspoof 2019 LA dataset show that our
proposed system is very effective for the audio deepfake detection task,
achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all
systems.
Related papers
- SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model [31.280358048556444]
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism.
The proposed system also incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder.
Experiments on the Opencpop dataset demonstrate efficacy of the proposed model in intonation quality and accuracy.
arXiv Detail & Related papers (2024-10-16T13:18:45Z) - Statistics-aware Audio-visual Deepfake Detector [11.671275975119089]
Methods in audio-visualfake detection mostly assess the synchronization between audio and visual features.
We propose a statistical feature loss to enhance the discrimination capability of the model.
Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
arXiv Detail & Related papers (2024-07-16T12:15:41Z) - Frequency-Aware Deepfake Detection: Improving Generalizability through
Frequency Space Learning [81.98675881423131]
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images.
Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries.
We introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors.
arXiv Detail & Related papers (2024-03-12T01:28:00Z) - Exploring Meta Information for Audio-based Zero-shot Bird Classification [113.17261694996051]
This study investigates how meta-information can improve zero-shot audio classification.
We use bird species as an example case study due to the availability of rich and diverse meta-data.
arXiv Detail & Related papers (2023-09-15T13:50:16Z) - Comparative Analysis of the wav2vec 2.0 Feature Extractor [42.18541127866435]
We study the capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model.
We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components.
arXiv Detail & Related papers (2023-08-08T14:29:35Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.