Exploring Temporal Coherence for More General Video Face Forgery
Detection
- URL: http://arxiv.org/abs/2108.06693v1
- Date: Sun, 15 Aug 2021 08:45:37 GMT
- Title: Exploring Temporal Coherence for More General Video Face Forgery
Detection
- Authors: Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen
- Abstract summary: We propose a novel end-to-end framework, which consists of two major stages.
The first stage is a fully temporal convolution network (FTCN), while maintaining the temporal convolution kernel size unchanged.
The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence.
- Score: 22.003901822221227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although current face manipulation techniques achieve impressive performance
regarding quality and controllability, they are struggling to generate temporal
coherent face videos. In this work, we explore to take full advantage of the
temporal coherence for video face forgery detection. To achieve this, we
propose a novel end-to-end framework, which consists of two major stages. The
first stage is a fully temporal convolution network (FTCN). The key insight of
FTCN is to reduce the spatial convolution kernel size to 1, while maintaining
the temporal convolution kernel size unchanged. We surprisingly find this
special design can benefit the model for extracting the temporal features as
well as improve the generalization capability. The second stage is a Temporal
Transformer network, which aims to explore the long-term temporal coherence.
The proposed framework is general and flexible, which can be directly trained
from scratch without any pre-training models or external datasets. Extensive
experiments show that our framework outperforms existing methods and remains
effective when applied to detect new sorts of face forgery videos.
Related papers
- UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition [60.84084172829169]
Adapting large pre-trained image models to few-shot action recognition has proven to be an effective strategy for learning robust feature extractors.
We present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$2$ST-Adapter), which is a novel tuning framework well-suited for few-shot action recognition.
arXiv Detail & Related papers (2023-12-03T15:40:10Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection [22.536129731902783]
We propose a Latemporal Spatio(LAST) approach to facilitate generalized face video detection.
We first model thetemporal patterns face videos by incorporating a lightweight CNN to extract local spatial features of each frame.
Then we learn the long-termtemporal representations in latent space videos, which should contain more clues than in pixel space.
arXiv Detail & Related papers (2023-09-09T13:40:44Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.