VDTR: Video Deblurring with Transformer
- URL: http://arxiv.org/abs/2204.08023v1
- Date: Sun, 17 Apr 2022 14:22:14 GMT
- Title: VDTR: Video Deblurring with Transformer
- Authors: Mingdeng Cao, Yanbo Fan, Yong Zhang, Jue Wang, Yujiu Yang
- Abstract summary: Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
- Score: 24.20183395758706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video deblurring is still an unsolved problem due to the challenging
spatio-temporal modeling process. While existing convolutional neural
network-based methods show a limited capacity for effective spatial and
temporal modeling for video deblurring. This paper presents VDTR, an effective
Transformer-based model that makes the first attempt to adapt Transformer for
video deblurring. VDTR exploits the superior long-range and relation modeling
capabilities of Transformer for both spatial and temporal modeling. However, it
is challenging to design an appropriate Transformer-based model for video
deblurring due to the complicated non-uniform blurs, misalignment across
multiple frames and the high computational costs for high-resolution spatial
modeling. To address these problems, VDTR advocates performing attention within
non-overlapping windows and exploiting the hierarchical structure for
long-range dependencies modeling. For frame-level spatial modeling, we propose
an encoder-decoder Transformer that utilizes multi-scale features for
deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse
multiple spatial features efficiently. Compared with CNN-based methods, the
proposed method achieves highly competitive results on both synthetic and
real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We
hope such a Transformer-based architecture can serve as a powerful alternative
baseline for video deblurring and other video restoration tasks. The source
code will be available at \url{https://github.com/ljzycmd/VDTR}.
Related papers
- MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.