Spatio-temporal Co-attention Fusion Network for Video Splicing
Localization
- URL: http://arxiv.org/abs/2309.09482v1
- Date: Mon, 18 Sep 2023 04:46:30 GMT
- Title: Spatio-temporal Co-attention Fusion Network for Video Splicing
Localization
- Authors: Man Lin, Gang Cao, Zijie Lou
- Abstract summary: A three-stream network is used as encoder to capture manipulation traces across multiple frames.
A lightweight multilayer perceptron (MLP) decoder is adopted to yield a pixel-level tampering localization map.
A new large-scale video splicing is created for training the SCFNet.
- Score: 2.3838507844983248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Digital video splicing has become easy and ubiquitous. Malicious users copy
some regions of a video and paste them to another video for creating realistic
forgeries. It is significant to blindly detect such forgery regions in videos.
In this paper, a spatio-temporal co-attention fusion network (SCFNet) is
proposed for video splicing localization. Specifically, a three-stream network
is used as an encoder to capture manipulation traces across multiple frames.
The deep interaction and fusion of spatio-temporal forensic features are
achieved by the novel parallel and cross co-attention fusion modules. A
lightweight multilayer perceptron (MLP) decoder is adopted to yield a
pixel-level tampering localization map. A new large-scale video splicing
dataset is created for training the SCFNet. Extensive tests on benchmark
datasets show that the localization and generalization performances of our
SCFNet outperform the state-of-the-art. Code and datasets will be available at
https://github.com/multimediaFor/SCFNet.
Related papers
- UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Collect-and-Distribute Transformer for 3D Point Cloud Analysis [82.03517861433849]
We propose a new transformer network equipped with a collect-and-distribute mechanism to communicate short- and long-range contexts of point clouds.
Results show the effectiveness of the proposed CDFormer, delivering several new state-of-the-art performances on point cloud classification and segmentation tasks.
arXiv Detail & Related papers (2023-06-02T03:48:45Z) - Adjacent Context Coordination Network for Salient Object Detection in
Optical Remote Sensing Images [102.75699068451166]
We propose a novel Adjacent Context Coordination Network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for optical RSI-SOD.
The proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU.
arXiv Detail & Related papers (2022-03-25T14:14:55Z) - PINs: Progressive Implicit Networks for Multi-Scale Neural
Representations [68.73195473089324]
We propose a progressive positional encoding, exposing a hierarchical structure to incremental sets of frequency encodings.
Our model accurately reconstructs scenes with wide frequency bands and learns a scene representation at progressive level of detail.
Experiments on several 2D and 3D datasets show improvements in reconstruction accuracy, representational capacity and training speed compared to baselines.
arXiv Detail & Related papers (2022-02-09T20:33:37Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - MSCFNet: A Lightweight Network With Multi-Scale Context Fusion for
Real-Time Semantic Segmentation [27.232578592161673]
We devise a novel lightweight network using a multi-scale context fusion scheme (MSCFNet)
The proposed MSCFNet contains only 1.15M parameters, achieves 71.9% Mean IoU and can run at over 50 FPS on a single Titan XP GPU configuration.
arXiv Detail & Related papers (2021-03-24T08:28:26Z) - GCF-Net: Gated Clip Fusion Network for Video Action Recognition [11.945392734711056]
We introduce the Gated Clip Fusion Network (GCF-Net) for video action recognition.
GCF-Net explicitly models the inter-dependencies between video clips to strengthen the receptive field of local clip descriptors.
On a large benchmark dataset (Kinetics-600), the proposed GCF-Net elevates the accuracy of existing action classifiers by 11.49%.
arXiv Detail & Related papers (2021-02-02T03:51:55Z) - Deep Video Inpainting Detection [95.36819088529622]
Video inpainting detection localizes an inpainted region in a video both spatially and temporally.
VIDNet, Video Inpainting Detection Network, contains a two-stream encoder-decoder architecture with attention module.
arXiv Detail & Related papers (2021-01-26T20:53:49Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z) - CRVOS: Clue Refining Network for Video Object Segmentation [5.947279761429668]
We propose a real-time network, Clue Network for Video Object refining (CRVOS), that does not have any intermediate network to efficiently deal with these scenarios.
Our proposed method shows the fastest fps speed among the existing methods with a competitive accuracy.
On DAVIS 2016 set, our method achieves 63.5 fps and J&F score of 81.6%.
arXiv Detail & Related papers (2020-02-10T10:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.