Related papers: MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

URL: http://arxiv.org/abs/2505.11109v1
Date: Fri, 16 May 2025 10:42:30 GMT
Title: MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark
Authors: Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah,
Abstract summary: We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection.<n>Our dataset comprises over 250 hours of real and fake videos across eight languages.<n>For each language, the fake videos are generated with seven distinct deepfake generation models.
Score: 108.46287432944392
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.

Related papers

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark [19.578741954970738]
This paper introduces the first benchmark for multilingual speech deepfake source tracing.<n>We comparatively investigate DSP- and SSL-based modeling, examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance.<n>Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ.
arXiv Detail & Related papers (2025-08-06T07:11:36Z)
Tell me Habibi, is it Real or Fake? [15.344187517040508]
Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication.<n>We introduce textbfArEnAV, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content.<n>Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection.
arXiv Detail & Related papers (2025-05-28T16:54:36Z)
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset [21.90332221144928]
We propose the AV-Deepfake1M dataset for the detection and localization of deepfake audio-visual content. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos.
arXiv Detail & Related papers (2023-11-26T14:17:51Z)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models. The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF) LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z)
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset [14.619865864254924]
Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations.
arXiv Detail & Related papers (2023-01-16T11:40:50Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.