Spatiotemporal Inconsistency Learning for DeepFake Video Detection
- URL: http://arxiv.org/abs/2109.01860v2
- Date: Tue, 7 Sep 2021 09:05:29 GMT
- Title: Spatiotemporal Inconsistency Learning for DeepFake Video Detection
- Authors: Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue
Huang, Lizhuang Ma
- Abstract summary: We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
- Score: 51.747219106855624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of facial manipulation techniques has aroused public
concerns in recent years. Following the success of deep learning, existing
methods always formulate DeepFake video detection as a binary classification
problem and develop frame-based and video-based solutions. However, little
attention has been paid to capturing the spatial-temporal inconsistency in
forged videos. To address this issue, we term this task as a Spatial-Temporal
Inconsistency Learning (STIL) process and instantiate it into a novel STIL
block, which consists of a Spatial Inconsistency Module (SIM), a Temporal
Inconsistency Module (TIM), and an Information Supplement Module (ISM).
Specifically, we present a novel temporal modeling paradigm in TIM by
exploiting the temporal difference over adjacent frames along with both
horizontal and vertical directions. And the ISM simultaneously utilizes the
spatial information from SIM and temporal information from TIM to establish a
more comprehensive spatial-temporal representation. Moreover, our STIL block is
flexible and could be plugged into existing 2D CNNs. Extensive experiments and
visualizations are presented to demonstrate the effectiveness of our method
against the state-of-the-art competitors.
Related papers
- Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - ASF-Net: Robust Video Deraining via Temporal Alignment and Online
Adaptive Learning [47.10392889695035]
We propose a new computational paradigm, Alignment-Shift-Fusion Network (ASF-Net), which incorporates a temporal shift module.
We construct a LArge-scale RAiny video dataset (LARA) which supports the development of this community.
Our proposed approach exhibits superior performance in three benchmarks and compelling visual quality in real-world scenarios.
arXiv Detail & Related papers (2023-09-02T14:50:13Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Deep Video Matting via Spatio-Temporal Alignment and Aggregation [63.6870051909004]
We propose a deep learning-based video matting framework which employs a novel aggregation feature module (STFAM)
To eliminate frame-by-frame trimap annotations, a lightweight interactive trimap propagation network is also introduced.
Our framework significantly outperforms conventional video matting and deep image matting methods.
arXiv Detail & Related papers (2021-04-22T17:42:08Z) - Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS)
We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it.
The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z) - Fast Video Salient Object Detection via Spatiotemporal Knowledge
Distillation [20.196945571479002]
We present a lightweight network tailored for video salient object detection.
Specifically, we combine a saliency guidance embedding structure and spatial knowledge distillation to refine the spatial features.
In the temporal aspect, we propose a temporal knowledge distillation strategy, which allows the network to learn the robust temporal features.
arXiv Detail & Related papers (2020-10-20T04:48:36Z) - CTM: Collaborative Temporal Modeling for Action Recognition [11.467061749436356]
We propose a Collaborative Temporal Modeling (CTM) block to learn temporal information for action recognition.
CTM includes two collaborative paths: a spatial-aware temporal modeling path, and a spatial-unaware temporal modeling path.
Experiments on several popular action recognition datasets demonstrate that CTM blocks bring the performance improvements on 2D CNN baselines.
arXiv Detail & Related papers (2020-02-08T12:14:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.