Related papers: A Multimodal Framework for Deepfake Detection

A Multimodal Framework for Deepfake Detection

URL: http://arxiv.org/abs/2410.03487v1
Date: Fri, 4 Oct 2024 14:59:10 GMT
Title: A Multimodal Framework for Deepfake Detection
Authors: Kashish Gandhi, Prutha Kulkarni, Taran Shah, Piyush Chaudhari, Meera Narvekar, Kranti Ghag,
Abstract summary: Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. Our research addresses the critical issue of deepfakes through an innovative multimodal approach. Our framework combines visual and auditory analyses, yielding an accuracy of 94%.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel-spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.

Related papers

Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline [56.790045049514326]
Two major forms of deception dominate: human-crafted misinformation and AI-generated content.<n>We propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception.<n>UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines.
arXiv Detail & Related papers (2025-09-30T09:26:32Z)
Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z)
Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework [19.53717894228692]
Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation.<n>We propose a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework.<n>Our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes.
arXiv Detail & Related papers (2025-06-09T02:13:04Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception [30.295294657519165]
Multimodal deepfakes involving audiovisual manipulations are a growing threat because they are difficult to detect with the naked eye or using unimodal deep learningbased forgery detection methods. In this study, we examine the detection capabilities of a large language model (LLM) to identify and account for any possible visual and auditory artifacts and manipulations in audiovisual deepfake content.
arXiv Detail & Related papers (2024-11-14T08:07:02Z)
Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights [49.81915942821647]
Deep Learning has been successfully applied in diverse fields, and its impact on deepfake detection is no exception. Deepfakes are fake yet realistic synthetic content that can be used deceitfully for political impersonation, phishing, slandering, or spreading misinformation. This paper aims to improve the effectiveness of deepfake detection strategies and guide future research in cybersecurity and media integrity.
arXiv Detail & Related papers (2024-11-12T09:02:11Z)
SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z)
Advancing Automated Deception Detection: A Multimodal Approach to Feature Extraction and Analysis [0.0]
This research focuses on the extraction and combination of various features to enhance the accuracy of deception detection models. By systematically extracting features from visual, audio, and text data, and experimenting with different combinations, we developed a robust model that achieved an impressive 99% accuracy.
arXiv Detail & Related papers (2024-07-08T14:59:10Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated. Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video. In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z)
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. We conduct a series of analyses to gain deeper insights into this task. We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z)
Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content. We extract and analyze the similarity between the two audio and visual modalities from within the same video. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.