MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection
- URL: http://arxiv.org/abs/2211.10996v1
- Date: Sun, 20 Nov 2022 15:17:24 GMT
- Title: MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection
- Authors: Davide Alessandro Coccomini, Giorgos Kordopatis Zilos, Giuseppe Amato,
Roberto Caldelli, Fabrizio Falchi, Symeon Papadopoulos, Claudio Gennaro
- Abstract summary: We introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes.
It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people.
- Score: 17.74528571088335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce MINTIME, a video deepfake detection approach that
captures spatial and temporal anomalies and handles instances of multiple
people in the same video and variations in face sizes. Previous approaches
disregard such information either by using simple a-posteriori aggregation
schemes, i.e., average or max operation, or using only one identity for the
inference, i.e., the largest one. On the contrary, the proposed approach builds
on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network
backbone to capture spatio-temporal anomalies from the face sequences of
multiple identities depicted in a video. This is achieved through an
Identity-aware Attention mechanism that attends to each face sequence
independently based on a masking operation and facilitates video-level
aggregation. In addition, two novel embeddings are employed: (i) the Temporal
Coherent Positional Embedding that encodes each face sequence's temporal
information and (ii) the Size Embedding that encodes the size of the faces as a
ratio to the video frame size. These extensions allow our system to adapt
particularly well in the wild by learning how to aggregate information of
multiple identities, which is usually disregarded by other methods in the
literature. It achieves state-of-the-art results on the ForgeryNet dataset with
an improvement of up to 14% AUC in videos containing multiple people and
demonstrates ample generalization capabilities in cross-forgery and
cross-dataset settings. The code is publicly available at
https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deep fake-Detection
Related papers
- A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.
We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.
We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z) - SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos.
We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images.
Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for
Enhanced Video Forgery Detection [19.432851794777754]
We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup.
Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames.
arXiv Detail & Related papers (2023-06-12T05:49:23Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - Seq-Masks: Bridging the gap between appearance and gait modeling for
video-based person re-identification [10.490428828061292]
ideo-based person re-identification (Re-ID) aims to match person images in video sequences captured by disjoint surveillance cameras.
Traditional video-based person Re-ID methods focus on exploring appearance information, thus, vulnerable against illumination changes, scene noises, camera parameters, and especially clothes/carrying variations.
We propose a framework that utilizes the sequence masks (SeqMasks) in the video to integrate appearance information and gait modeling in a close fashion.
arXiv Detail & Related papers (2021-12-10T16:00:20Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Self-attention aggregation network for video face representation and
recognition [0.0]
We propose a new model architecture for video face representation and recognition based on a self-attention mechanism.
Our approach could be used for video with single and multiple identities.
arXiv Detail & Related papers (2020-10-11T20:57:46Z) - Sharp Multiple Instance Learning for DeepFake Video Detection [54.12548421282696]
We introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated.
A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction.
Experiments on FFPMS and widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection.
arXiv Detail & Related papers (2020-08-11T08:52:17Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.