An Efficient Temporary Deepfake Location Approach Based Embeddings for
Partially Spoofed Audio Detection
- URL: http://arxiv.org/abs/2309.03036v2
- Date: Tue, 21 Nov 2023 10:09:46 GMT
- Title: An Efficient Temporary Deepfake Location Approach Based Embeddings for
Partially Spoofed Audio Detection
- Authors: Yuankun Xie, Haonan Cheng, Yutian Wang, Long Ye
- Abstract summary: We propose a fine-grained partially spoofed audio detection method, namely Temporal Deepfake Location (TDL)
Our approach involves two novel parts: embedding similarity module and temporal convolution operation.
Our method outperform baseline models in ASVspoof 2019 Partial Spoof dataset and demonstrate superior performance even in the crossdataset scenario.
- Score: 4.055489363682199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Partially spoofed audio detection is a challenging task, lying in the need to
accurately locate the authenticity of audio at the frame level. To address this
issue, we propose a fine-grained partially spoofed audio detection method,
namely Temporal Deepfake Location (TDL), which can effectively capture
information of both features and locations. Specifically, our approach involves
two novel parts: embedding similarity module and temporal convolution
operation. To enhance the identification between the real and fake features,
the embedding similarity module is designed to generate an embedding space that
can separate the real frames from fake frames. To effectively concentrate on
the position information, temporal convolution operation is proposed to
calculate the frame-specific similarities among neighboring frames, and
dynamically select informative neighbors to convolution. Extensive experiments
show that our method outperform baseline models in ASVspoof2019 Partial Spoof
dataset and demonstrate superior performance even in the crossdataset scenario.
Related papers
- Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies [11.671275975119089]
We propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains.
First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio.
Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set.
arXiv Detail & Related papers (2024-08-13T09:19:59Z) - Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention Mechanism [17.468808107791265]
We propose a novel method called Boundary-aware Attention Mechanism (BAM)
BAM consists of two core modules: Boundary Enhancement and Boundary Frame-wise Attention.
Experimental results on PartialSpoof database demonstrate our proposed method achieves the best performance.
arXiv Detail & Related papers (2024-07-31T13:49:17Z) - Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization [60.899082019130766]
We introduce a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization.
FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions.
PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN.
arXiv Detail & Related papers (2024-07-23T15:07:52Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - MomentDiff: Generative Video Moment Retrieval from Random to Real [71.40038773943638]
We provide a generative diffusion-based framework called MomentDiff.
MomentDiff simulates a typical human retrieval process from random browsing to gradual localization.
We show that MomentDiff consistently outperforms state-of-the-art methods on three public benchmarks.
arXiv Detail & Related papers (2023-07-06T09:12:13Z) - Learning Sequence Descriptor based on Spatio-Temporal Attention for
Visual Place Recognition [16.380948630155476]
Visual Place Recognition (VPR) aims to retrieve frames from atagged database that are located at the same place as the query frame.
To improve the robustness of VPR in geoly aliasing scenarios, sequence-based VPR methods are proposed.
We use a sliding window to control the temporal range of attention and use relative positional encoding to construct sequential relationships between different features.
arXiv Detail & Related papers (2023-05-19T06:39:10Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset
and Multimodal Method for Temporal Forgery Localization [19.490174583625862]
We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF)
Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video.
Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.
arXiv Detail & Related papers (2022-04-13T08:02:11Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.