Recurrent Video Restoration Transformer with Guided Deformable Attention
- URL: http://arxiv.org/abs/2206.02146v1
- Date: Sun, 5 Jun 2022 10:36:09 GMT
- Title: Recurrent Video Restoration Transformer with Guided Deformable Attention
- Authors: Jingyun Liang and Yuchen Fan and Xiaoyu Xiang and Rakesh Ranjan and
Eddy Ilg and Simon Green and Jiezhang Cao and Kai Zhang and Radu Timofte and
Luc Van Gool
- Abstract summary: We propose RVRT, which processes local neighboring frames in parallel within a globally recurrent framework.
RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.
- Score: 116.1684355529431
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video restoration aims at restoring multiple high-quality frames from
multiple low-quality frames. Existing video restoration methods generally fall
into two extreme cases, i.e., they either restore all frames in parallel or
restore the video frame by frame in a recurrent way, which would result in
different merits and drawbacks. Typically, the former has the advantage of
temporal information fusion. However, it suffers from large model size and
intensive memory consumption; the latter has a relatively small model size as
it shares parameters across frames; however, it lacks long-range dependency
modeling ability and parallelizability. In this paper, we attempt to integrate
the advantages of the two cases by proposing a recurrent video restoration
transformer, namely RVRT. RVRT processes local neighboring frames in parallel
within a globally recurrent framework which can achieve a good trade-off
between model size, effectiveness, and efficiency. Specifically, RVRT divides
the video into multiple clips and uses the previously inferred clip feature to
estimate the subsequent clip feature. Within each clip, different frame
features are jointly updated with implicit feature aggregation. Across
different clips, the guided deformable attention is designed for clip-to-clip
alignment, which predicts multiple relevant locations from the whole inferred
clip and aggregates their features by the attention mechanism. Extensive
experiments on video super-resolution, deblurring, and denoising show that the
proposed RVRT achieves state-of-the-art performance on benchmark datasets with
balanced model size, testing memory and runtime.
Related papers
- RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter [77.0205013713008]
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries.
To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision models.
We propose a sparse-andcorrelated AdaPter (RAP) to fine-tune the pre-trained model with a few parameterized layers.
arXiv Detail & Related papers (2024-05-29T19:23:53Z) - Temporal Consistency Learning of inter-frames for Video Super-Resolution [38.26035126565062]
Video super-resolution (VSR) is a task that aims to reconstruct high-resolution (HR) frames from the low-resolution (LR) reference frame and multiple neighboring frames.
Existing methods generally explore information propagation and frame alignment to improve the performance of VSR.
We propose a Temporal Consistency learning Network (TCNet) for VSR in an end-to-end manner, to enhance the consistency of the reconstructed videos.
arXiv Detail & Related papers (2022-11-03T08:23:57Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Revisiting Temporal Alignment for Video Restoration [39.05100686559188]
Long-range temporal alignment is critical yet challenging for video restoration tasks.
We present a novel, generic iterative alignment module which employs a gradual refinement scheme for sub-alignments.
Our model achieves state-of-the-art performance on multiple benchmarks across a range of video restoration tasks.
arXiv Detail & Related papers (2021-11-30T11:08:52Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.