TALL: Thumbnail Layout for Deepfake Video Detection
- URL: http://arxiv.org/abs/2307.07494v3
- Date: Sun, 18 Feb 2024 01:58:02 GMT
- Title: TALL: Thumbnail Layout for Deepfake Video Detection
- Authors: Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He
- Abstract summary: This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL)
TALL transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies.
Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin.
- Score: 84.12790488801264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing threats of deepfakes to society and cybersecurity have raised
enormous public concerns, and increasing efforts have been devoted to this
critical topic of deepfake video detection. Existing video methods achieve good
performance but are computationally intensive. This paper introduces a simple
yet effective strategy named Thumbnail Layout (TALL), which transforms a video
clip into a pre-defined layout to realize the preservation of spatial and
temporal dependencies. Specifically, consecutive frames are masked in a fixed
position in each frame to improve generalization, then resized to sub-images
and rearranged into a pre-defined layout as the thumbnail. TALL is
model-agnostic and extremely simple by only modifying a few lines of code.
Inspired by the success of vision transformers, we incorporate TALL into Swin
Transformer, forming an efficient and effective method TALL-Swin. Extensive
experiments on intra-dataset and cross-dataset validate the validity and
superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$\%$ AUC on the
challenging cross-dataset task, FaceForensics++ $\to$ Celeb-DF. The code is
available at https://github.com/rainy-xu/TALL4Deepfake.
Related papers
- Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection [41.35861722481721]
Deepfake threats to society and cybersecurity have provoked significant public apprehension.
This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL)
TALL transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies.
arXiv Detail & Related papers (2024-03-15T12:48:44Z) - LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing.
We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts.
Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Deepfake Video Detection with Spatiotemporal Dropout Transformer [32.577096083927884]
This paper proposes a simple yet effective patch-level approach to facilitate deepfake video detection via a dropout transformer.
The approach reorganizes each input video into bag of patches that is then fed into a vision transformer to achieve robust representation.
arXiv Detail & Related papers (2022-07-14T02:04:42Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - Unsupervised Visual Representation Learning by Tracking Patches in Video [88.56860674483752]
We propose to use tracking as a proxy task for a computer vision system to learn the visual representations.
Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations.
arXiv Detail & Related papers (2021-05-06T09:46:42Z) - Sharp Multiple Instance Learning for DeepFake Video Detection [54.12548421282696]
We introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated.
A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction.
Experiments on FFPMS and widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection.
arXiv Detail & Related papers (2020-08-11T08:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.