Related papers: Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

URL: http://arxiv.org/abs/2406.13384v1
Date: Wed, 19 Jun 2024 09:26:22 GMT
Title: Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection
Authors: Aravinda Reddy PN, Raghavendra Ramachandra, Krothapalli Sreenivasa Rao, Pabitra Mitra, Vinod Rathod,
Abstract summary: multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting. In this paper, we introduce the Straight-through Gumbel-Softmax framework, offering a comprehensive approach to search multimodal fusion model architectures. Experiments on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4% achieved with minimal model parameters.
Score: 6.367999777464464
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.

Related papers

TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents [0.4881924950569192]
TwoHead-SwinFPN is a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents.<n>Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation.<n>Experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31% accuracy, 90.78% AUC for classification, and 57.24% mean Dice score for localization.
arXiv Detail & Related papers (2026-01-19T09:50:51Z)
AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection [7.76090543025328]
Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection.<n>A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models.<n>We address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques.
arXiv Detail & Related papers (2025-12-19T16:06:03Z)
Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation [18.402668470092294]
Synthetic video generation can produce very realistic high-resolution videos that are virtually indistinguishable from real ones.<n>Several video forensic detectors have been recently proposed, but they often exhibit poor generalization.<n>We introduce a novel data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues.<n>Our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models.
arXiv Detail & Related papers (2025-06-20T07:36:59Z)
FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes [9.462613446025001]
Face-fake Deepfake videos pose growing risks to digital security, privacy, and media integrity.<n>FAME is a framework designed to capture subtle artifacts specific to different face-generative models.<n>Results show that FAME consistently outperforms existing methods in both accuracy and runtime.
arXiv Detail & Related papers (2025-06-13T05:47:09Z)
DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models [43.86847047796023]
Current deepfake detection methods often depend on datasets with limited generation models and content diversity.<n>We present textbfDFBench, a large-scale DeepFake Benchmark featuring 540,000 images across real, AI-edited, and AI-generated content.<n>We propose textbfMoA-DF, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs.
arXiv Detail & Related papers (2025-06-03T15:45:41Z)
Deepfake Detection with Optimized Hybrid Model: EAR Biometric Descriptor via Improved RCNN [1.1356542363919058]
We introduce robust detection of subtle ear movements and shape changes to generate ear descriptors. We also propose a novel optimized hybrid deepfake detection model that considers the ear biometric descriptors via enhanced RCNN. Our proposed method outperforms traditional models such as CNN (Convolution Neural Network), SqueezeNet, LeNet, LinkNet, LSTM (Long Short-Term Memory), DFP (Deepfake Predictor), and ResNext+CNN+LSTM.
arXiv Detail & Related papers (2025-03-16T07:01:29Z)
HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection [4.908389661988192]
HFMF is a comprehensive two-stage deepfake detection framework. It integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism. We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks.
arXiv Detail & Related papers (2025-01-10T00:20:29Z)
Gumbel Rao Monte Carlo based Bi-Modal Neural Architecture Search for Audio-Visual Deepfake Detection [2.711788614039839]
Deepfakes pose a critical threat to biometric authentication systems by generating highly realistic synthetic media. Existing multimodal deepfake detectors often struggle to adapt to diverse data and rely on simple fusion methods. We propose a novel architecture search framework that employs Gumbel-Rao Monte Carlo sampling to optimize multimodal fusion.
arXiv Detail & Related papers (2024-10-09T04:37:35Z)
Leveraging Mixture of Experts for Improved Speech Deepfake Detection [53.69740463004446]
Speech deepfakes pose a significant threat to personal security and content authenticity. We introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture.
arXiv Detail & Related papers (2024-09-24T13:24:03Z)
EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition [54.99121380536659]
Eye movement biometrics have received increasing attention thanks to its high secure identification. Deep learning (DL) models have been recently successfully applied for eye movement recognition. DL architecture still is determined by human prior knowledge. We propose EM-DARTS, a hierarchical differentiable architecture search algorithm to automatically design the DL architecture for eye movement recognition.
arXiv Detail & Related papers (2024-09-22T13:11:08Z)
A Noise and Edge extraction-based dual-branch method for Shallowfake and Deepfake Localization [15.647035299476894]
We develop a dual-branch model that integrates manually designed feature noise with conventional CNN features. The model is superior in comparison and easily outperforms the existing state-of-the-art (SoTA) models.
arXiv Detail & Related papers (2024-09-02T02:18:34Z)
Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [58.60915132222421]
We introduce an approach that is both general and parameter-efficient for face forgery detection. We design a forgery-style mixture formulation that augments the diversity of forgery source domains. We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z)
A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
A popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution feature as guidance. We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives. We develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
arXiv Detail & Related papers (2024-07-02T14:12:21Z)
CapST: An Enhanced and Lightweight Model Attribution Approach for Synthetic Videos [9.209808258321559]
This paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM) The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio. Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos.
arXiv Detail & Related papers (2023-11-07T08:05:09Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
Domain Generalization via Ensemble Stacking for Face Presentation Attack Detection [4.61143637299349]
Face Presentation Attack Detection (PAD) plays a pivotal role in securing face recognition systems against spoofing attacks. This work proposes a comprehensive solution that combines synthetic data generation and deep ensemble learning. Experimental results on four datasets demonstrate low half total error rates (HTERs) on three benchmark datasets.
arXiv Detail & Related papers (2023-01-05T16:44:36Z)
Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
ASFD: Automatic and Scalable Face Detector [129.82350993748258]
We propose a novel Automatic and Scalable Face Detector (ASFD) ASFD is based on a combination of neural architecture search techniques as well as a new loss design. Our ASFD-D6 outperforms the prior strong competitors, and our lightweight ASFD-D0 runs at more than 120 FPS with Mobilenet for VGA-resolution images.
arXiv Detail & Related papers (2020-03-25T06:00:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.