ViStripformer: A Token-Efficient Transformer for Versatile Video
Restoration
- URL: http://arxiv.org/abs/2312.14502v1
- Date: Fri, 22 Dec 2023 08:05:38 GMT
- Title: ViStripformer: A Token-Efficient Transformer for Versatile Video
Restoration
- Authors: Fu-Jen Tsai, Yan-Tsung Peng, Chen-Yu Chang, Chan-Yu Li, Yen-Yu Lin,
Chung-Chi Tsai, and Chia-Wen Lin
- Abstract summary: ViStripformer is an effective and efficient transformer architecture with much lower memory usage than the vanilla transformer.
It decomposes video frames into strip-shaped features in horizontal and vertical directions for Intra-SA and Inter-SA to address degradation patterns with various orientations and magnitudes.
- Score: 42.356013390749204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video restoration is a low-level vision task that seeks to restore clean,
sharp videos from quality-degraded frames. One would use the temporal
information from adjacent frames to make video restoration successful.
Recently, the success of the Transformer has raised awareness in the
computer-vision community. However, its self-attention mechanism requires much
memory, which is unsuitable for high-resolution vision tasks like video
restoration. In this paper, we propose ViStripformer (Video Stripformer), which
utilizes spatio-temporal strip attention to catch long-range data correlations,
consisting of intra-frame strip attention (Intra-SA) and inter-frame strip
attention (Inter-SA) for extracting spatial and temporal information. It
decomposes video frames into strip-shaped features in horizontal and vertical
directions for Intra-SA and Inter-SA to address degradation patterns with
various orientations and magnitudes. Besides, ViStripformer is an effective and
efficient transformer architecture with much lower memory usage than the
vanilla transformer. Extensive experiments show that the proposed model
achieves superior results with fast inference time on video restoration tasks,
including video deblurring, demoireing, and deraining.
Related papers
- TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Video Event Restoration Based on Keyframes for Video Anomaly Detection [9.18057851239942]
Existing deep neural network based anomaly detection (VAD) methods mostly follow the route of frame reconstruction or frame prediction.
We introduce a brand-new VAD paradigm to break through these limitations.
We propose a novel U-shaped Swin Transformer Network with Dual Skip Connections (USTN-DSC) for video event restoration.
arXiv Detail & Related papers (2023-04-11T10:13:19Z) - Recurrent Video Restoration Transformer with Guided Deformable Attention [116.1684355529431]
We propose RVRT, which processes local neighboring frames in parallel within a globally recurrent framework.
RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.
arXiv Detail & Related papers (2022-06-05T10:36:09Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.