Learning Trajectory-Aware Transformer for Video Super-Resolution
- URL: http://arxiv.org/abs/2204.04216v1
- Date: Fri, 8 Apr 2022 03:37:39 GMT
- Title: Learning Trajectory-Aware Transformer for Video Super-Resolution
- Authors: Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian
- Abstract summary: Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
- Score: 50.49396123016185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video super-resolution (VSR) aims to restore a sequence of high-resolution
(HR) frames from their low-resolution (LR) counterparts. Although some progress
has been made, there are grand challenges to effectively utilize temporal
dependency in entire video sequences. Existing approaches usually align and
aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames),
which prevents these approaches from satisfactory results. In this paper, we
take one step further to enable effective spatio-temporal learning in videos.
We propose a novel Trajectory-aware Transformer for Video Super-Resolution
(TTVSR). In particular, we formulate video frames into several pre-aligned
trajectories which consist of continuous visual tokens. For a query token,
self-attention is only learned on relevant visual tokens along spatio-temporal
trajectories. Compared with vanilla vision Transformers, such a design
significantly reduces the computational cost and enables Transformers to model
long-range features. We further propose a cross-scale feature tokenization
module to overcome scale-changing problems that often occur in long-range
videos. Experimental results demonstrate the superiority of the proposed TTVSR
over state-of-the-art models, by extensive quantitative and qualitative
evaluations in four widely-used video super-resolution benchmarks. Both code
and pre-trained models can be downloaded at
https://github.com/researchmm/TTVSR.
Related papers
- TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Retargeting video with an end-to-end framework [14.270721529264929]
We present an end-to-end RETVI method to retarget videos to arbitrary ratios.
Our system outperforms previous work in quality and running time.
arXiv Detail & Related papers (2023-11-08T04:56:41Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - VideoINR: Learning Video Implicit Neural Representation for Continuous
Space-Time Super-Resolution [75.79379734567604]
We show that Video Implicit Neural Representation (VideoINR) can be decoded to videos of arbitrary spatial resolution and frame rate.
We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales.
arXiv Detail & Related papers (2022-06-09T17:45:49Z) - VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.