Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset
and Multimodal Method for Temporal Forgery Localization
- URL: http://arxiv.org/abs/2204.06228v2
- Date: Thu, 4 May 2023 00:41:33 GMT
- Title: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset
and Multimodal Method for Temporal Forgery Localization
- Authors: Zhixi Cai, Kalin Stefanov, Abhinav Dhall, Munawar Hayat
- Abstract summary: We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF)
Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video.
Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.
- Score: 19.490174583625862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to its high societal impact, deepfake detection is getting active
attention in the computer vision community. Most deepfake detection methods
rely on identity, facial attributes, and adversarial perturbation-based
spatio-temporal modifications at the whole video or random locations while
keeping the meaning of the content intact. However, a sophisticated deepfake
may contain only a small segment of video/audio manipulation, through which the
meaning of the content can be, for example, completely inverted from a
sentiment perspective. We introduce a content-driven audio-visual deepfake
dataset, termed Localized Audio Visual DeepFake (LAV-DF), explicitly designed
for the task of learning temporal forgery localization. Specifically, the
content-driven audio-visual manipulations are performed strategically to change
the sentiment polarity of the whole video. Our baseline method for benchmarking
the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal
Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching,
and frame classification loss functions. Our extensive quantitative and
qualitative analysis demonstrates the proposed method's strong performance for
temporal forgery localization and deepfake detection tasks.
Related papers
- DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization [13.840950434728533]
We present a novel audio-visual deepfake detection framework.
Based on the assumption that in real samples - in contrast to deepfakes - visual and audio signals coincide in terms of information.
We use features from deep networks that specialize in video and audio speech recognition to spot frame-level cross-modal incongruities.
arXiv Detail & Related papers (2024-11-15T13:47:33Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity.
Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat.
We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z) - AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset [21.90332221144928]
We propose the AV-Deepfake1M dataset for the detection and localization of deepfake audio-visual content.
The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos.
arXiv Detail & Related papers (2023-11-26T14:17:51Z) - An Efficient Temporary Deepfake Location Approach Based Embeddings for
Partially Spoofed Audio Detection [4.055489363682199]
We propose a fine-grained partially spoofed audio detection method, namely Temporal Deepfake Location (TDL)
Our approach involves two novel parts: embedding similarity module and temporal convolution operation.
Our method outperform baseline models in ASVspoof 2019 Partial Spoof dataset and demonstrate superior performance even in the crossdataset scenario.
arXiv Detail & Related papers (2023-09-06T14:29:29Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Content-Based Detection of Temporal Metadata Manipulation [91.34308819261905]
We propose an end-to-end approach to verify whether the purported time of capture of an image is consistent with its content and geographic location.
The central idea is the use of supervised consistency verification, in which we predict the probability that the image content, capture time, and geographical location are consistent.
Our approach improves upon previous work on a large benchmark dataset, increasing the classification accuracy from 59.03% to 81.07%.
arXiv Detail & Related papers (2021-03-08T13:16:19Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z) - SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A
Learnable Scene Descriptor [51.298760338410624]
We propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information.
The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene.
We also design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label.
arXiv Detail & Related papers (2020-01-24T16:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.