RSTT: Real-time Spatial Temporal Transformer for Space-Time Video
Super-Resolution
- URL: http://arxiv.org/abs/2203.14186v1
- Date: Sun, 27 Mar 2022 02:16:26 GMT
- Title: RSTT: Real-time Spatial Temporal Transformer for Space-Time Video
Super-Resolution
- Authors: Zhicheng Geng, Luming Liang, Tianyu Ding, Ilya Zharkov
- Abstract summary: Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts.
We propose using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model.
- Score: 13.089535703790425
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Space-time video super-resolution (STVSR) is the task of interpolating videos
with both Low Frame Rate (LFR) and Low Resolution (LR) to produce
High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts. The existing
methods based on Convolutional Neural Network~(CNN) succeed in achieving
visually satisfied results while suffer from slow inference speed due to their
heavy architectures. We propose to resolve this issue by using a
spatial-temporal transformer that naturally incorporates the spatial and
temporal super resolution modules into a single model. Unlike CNN-based
methods, we do not explicitly use separated building blocks for temporal
interpolations and spatial super-resolutions; instead, we only use a single
end-to-end transformer architecture. Specifically, a reusable dictionary is
built by encoders based on the input LFR and LR frames, which is then utilized
in the decoder part to synthesize the HFR and HR frames. Compared with the
state-of-the-art TMNet \cite{xu2021temporal}, our network is $60\%$ smaller
(4.5M vs 12.3M parameters) and $80\%$ faster (26.2fps vs 14.3fps on
$720\times576$ frames) without sacrificing much performance. The source code is
available at https://github.com/llmpass/RSTT.
Related papers
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - VideoINR: Learning Video Implicit Neural Representation for Continuous
Space-Time Super-Resolution [75.79379734567604]
We show that Video Implicit Neural Representation (VideoINR) can be decoded to videos of arbitrary spatial resolution and frame rate.
We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales.
arXiv Detail & Related papers (2022-06-09T17:45:49Z) - STDAN: Deformable Attention Network for Space-Time Video
Super-Resolution [39.18399652834573]
We propose a deformable attention network called STDAN for STVSR.
First, we devise a long-short term feature (LSTFI) module, which is capable of abundant content from more neighboring input frames.
Second, we put forward a spatial-temporal deformable feature aggregation (STDFA) module, in which spatial and temporal contexts are adaptively captured and aggregated.
arXiv Detail & Related papers (2022-03-14T03:40:35Z) - Temporal Modulation Network for Controllable Space-Time Video
Super-Resolution [66.06549492893947]
Space-time video super-resolution aims to increase the spatial and temporal resolutions of low-resolution and low-frame-rate videos.
Deformable convolution based methods have achieved promising STVSR performance, but they could only infer the intermediate frame pre-defined in the training stage.
We propose a Temporal Modulation Network (TMNet) to interpolate arbitrary intermediate frame(s) with accurate high-resolution reconstruction.
arXiv Detail & Related papers (2021-04-21T17:10:53Z) - Zooming SlowMo: An Efficient One-Stage Framework for Space-Time Video
Super-Resolution [100.11355888909102]
Space-time video super-resolution aims at generating a high-resolution (HR) slow-motion video from a low-resolution (LR) and low frame rate (LFR) video sequence.
We present a one-stage space-time video super-resolution framework, which can directly reconstruct an HR slow-motion video sequence from an input LR and LFR video.
arXiv Detail & Related papers (2021-04-15T17:59:23Z) - Efficient Space-time Video Super Resolution using Low-Resolution Flow
and Mask Upsampling [12.856102293479486]
This paper aims to generate High-resolution Slow-motion videos from Low Resolution and Low Frame rate videos.
A simplistic solution is the sequential running of Video Super Resolution and Video Frame models.
Our model is lightweight and performs better than current state-of-the-art models in REDS STSR validation set.
arXiv Detail & Related papers (2021-04-12T19:11:57Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.