Related papers: AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

URL: http://arxiv.org/abs/2311.15308v2
Date: Mon, 29 Jul 2024 06:24:07 GMT
Title: AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
Authors: Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov,
Abstract summary: We propose the AV-Deepfake1M dataset for the detection and localization of deepfake audio-visual content. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos.
Score: 21.90332221144928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .

Related papers

ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection [49.14187862877009]
We present ERF-BA-TFD+, a novel deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion.<n>Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness.<n>We evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips.
arXiv Detail & Related papers (2025-08-24T10:03:46Z)
AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations [15.420752640434513]
This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++.<n>Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge.
arXiv Detail & Related papers (2025-07-28T07:27:42Z)
MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark [108.46287432944392]
We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection.<n>Our dataset comprises over 250 hours of real and fake videos across eight languages.<n>For each language, the fake videos are generated with seven distinct deepfake generation models.
arXiv Detail & Related papers (2025-05-16T10:42:30Z)
1M-Deepfakes Detection Challenge [31.994908331728958]
The 1M-Deepfakes Detection Challenge is designed to engage the research community in developing advanced methods for detecting and localizing deepfake manipulations. The participants can access the AV-Deepfake1M dataset and are required to submit their inference results for evaluation. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection and localization systems.
arXiv Detail & Related papers (2024-09-11T03:43:53Z)
Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF) LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z)
PVDD: A Practical Video Denoising Dataset with Real-World Dynamic Scenes [56.4361151691284]
"Practical Video Denoising dataset" (PVDD) contains 200 noisy-clean dynamic video pairs in both sRGB and RAW format. Compared with existing datasets consisting of limited motion information,PVDD covers dynamic scenes with varying natural motion.
arXiv Detail & Related papers (2022-07-04T12:30:22Z)
Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization [19.490174583625862]
We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF) Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video. Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.
arXiv Detail & Related papers (2022-04-13T08:02:11Z)
Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos. We propose to perform the deepfake detection from an unexplored voice-face matching view. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content. We extract and analyze the similarity between the two audio and visual modalities from within the same video. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.