LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech
- URL: http://arxiv.org/abs/2507.16220v2
- Date: Thu, 24 Jul 2025 01:25:18 GMT
- Title: LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech
- Authors: Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi,
- Abstract summary: LENS-DF is a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization.<n>We conduct experiments based on self-supervised learning front-end and simple back-end.<n>The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes.
- Score: 35.36044093564255
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.
Related papers
- Exploring the Frontiers of kNN Noisy Feature Detection and Recovery for Self-Driving Labs [0.49478969093606673]
This study develops an automated workflow to detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values.<n>A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features.
arXiv Detail & Related papers (2025-07-15T03:35:56Z) - Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies [11.671275975119089]
We propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains.
First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio.
Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set.
arXiv Detail & Related papers (2024-08-13T09:19:59Z) - Statistics-aware Audio-visual Deepfake Detector [11.671275975119089]
Methods in audio-visualfake detection mostly assess the synchronization between audio and visual features.
We propose a statistical feature loss to enhance the discrimination capability of the model.
Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
arXiv Detail & Related papers (2024-07-16T12:15:41Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Frequency-Aware Deepfake Detection: Improving Generalizability through
Frequency Space Learning [81.98675881423131]
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images.
Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries.
We introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors.
arXiv Detail & Related papers (2024-03-12T01:28:00Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Unsupervised Acoustic Scene Mapping Based on Acoustic Features and
Dimensionality Reduction [18.641610823584433]
We introduce an unsupervised data-driven approach that exploits the natural structure of the data.
Our method builds upon local conformal autoencoders (LOCA) - an offline deep learning scheme for learning standardized data coordinates from measurements.
arXiv Detail & Related papers (2023-01-01T17:46:09Z) - Improving the Robustness of Summarization Models by Detecting and
Removing Input Noise [50.27105057899601]
We present a large empirical study quantifying the sometimes severe loss in performance from different types of input noise for a range of datasets and model sizes.
We propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any training, auxiliary models, or even prior knowledge of the type of noise.
arXiv Detail & Related papers (2022-12-20T00:33:11Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.