Video Super-Resolution Transformer
- URL: http://arxiv.org/abs/2106.06847v3
- Date: Tue, 4 Jul 2023 15:30:58 GMT
- Title: Video Super-Resolution Transformer
- Authors: Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool
- Abstract summary: Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
- Score: 85.11270760456826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video super-resolution (VSR), with the aim to restore a high-resolution video
from its corresponding low-resolution version, is a spatial-temporal sequence
prediction problem. Recently, Transformer has been gaining popularity due to
its parallel computing ability for sequence-to-sequence modeling. Thus, it
seems to be straightforward to apply the vision Transformer to solve VSR.
However, the typical block design of Transformer with a fully connected
self-attention layer and a token-wise feed-forward layer does not fit well for
VSR due to the following two reasons. First, the fully connected self-attention
layer neglects to exploit the data locality because this layer relies on linear
layers to compute attention maps. Second, the token-wise feed-forward layer
lacks the feature alignment which is important for VSR since this layer
independently processes each of the input token embeddings without any
interaction among them. In this paper, we make the first attempt to adapt
Transformer for VSR. Specifically, to tackle the first issue, we present a
spatial-temporal convolutional self-attention layer with a theoretical
understanding to exploit the locality information. For the second issue, we
design a bidirectional optical flow-based feed-forward layer to discover the
correlations across different video frames and also align features. Extensive
experiments on several benchmark datasets demonstrate the effectiveness of our
proposed method. The code will be available at
https://github.com/caojiezhang/VSR-Transformer.
Related papers
- FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [11.146015814220858]
FIRST is an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence.
Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on task.
arXiv Detail & Related papers (2024-10-16T12:45:35Z) - Convolution and Attention Mixer for Synthetic Aperture Radar Image
Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community.
Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs)
We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z) - Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR.
Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner.
Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - Is Attention All NeRF Needs? [103.51023982774599]
Generalizable NeRF Transformer (GNT) is a pure, unified transformer-based architecture that efficiently reconstructs Neural Radiance Fields (NeRFs) on the fly from source views.
GNT achieves generalizable neural scene representation and rendering, by encapsulating two transformer-based stages.
arXiv Detail & Related papers (2022-07-27T05:09:54Z) - VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z) - EDTER: Edge Detection with Transformer [71.83960813880843]
We propose a novel transformer-based edge detector, emphEdge Detection TransformER (EDTER), to extract clear and crisp object boundaries and meaningful edges.
EDTER exploits the full image context information and detailed local cues simultaneously.
Experiments on BSDS500, NYUDv2, and Multicue demonstrate the superiority of EDTER in comparison with state-of-the-arts.
arXiv Detail & Related papers (2022-03-16T11:55:55Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - TransVOS: Video Object Segmentation with Transformers [13.311777431243296]
We propose a vision transformer to fully exploit and model both the temporal and spatial relationships.
To slim the popular two-encoder pipeline, we design a single two-path feature extractor.
Experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-06-01T15:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.