Video Frame Interpolation Transformer
- URL: http://arxiv.org/abs/2111.13817v1
- Date: Sat, 27 Nov 2021 05:35:10 GMT
- Title: Video Frame Interpolation Transformer
- Authors: Zhihao Shi, Xiangyu Xu, Xiaohong Liu, Jun Chen, Ming-Hsuan Yang
- Abstract summary: We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
- Score: 86.20646863821908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for video interpolation heavily rely on deep convolution
neural networks, and thus suffer from their intrinsic limitations, such as
content-agnostic kernel weights and restricted receptive field. To address
these issues, we propose a Transformer-based video interpolation framework that
allows content-aware aggregation weights and considers long-range dependencies
with the self-attention operations. To avoid the high computational cost of
global self-attention, we introduce the concept of local attention into video
interpolation and extend it to the spatial-temporal domain. Furthermore, we
propose a space-time separation strategy to save memory usage, which also
improves performance. In addition, we develop a multi-scale frame synthesis
scheme to fully realize the potential of Transformers. Extensive experiments
demonstrate the proposed model performs favorably against the state-of-the-art
methods both quantitatively and qualitatively on a variety of benchmark
datasets.
Related papers
- Video Dynamics Prior: An Internal Learning Approach for Robust Video
Enhancements [83.5820690348833]
We present a framework for low-level vision tasks that does not require any external training data corpus.
Our approach learns neural modules by optimizing over a corrupted sequence, leveraging the weights of the coherence-temporal test and statistics internal statistics.
arXiv Detail & Related papers (2023-12-13T01:57:11Z) - Video Frame Interpolation with Flow Transformer [31.371987879960287]
Video frame has been actively studied with the development of convolutional neural networks.
We propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism.
Our framework is suitable for interpolating frames with large motion while maintaining reasonably low complexity.
arXiv Detail & Related papers (2023-07-30T06:44:37Z) - Efficient Convolution and Transformer-Based Network for Video Frame
Interpolation [11.036815066639473]
A novel method integrating a transformer encoder and convolutional features is proposed.
This network reduces the memory burden by close to 50% and runs up to four times faster during inference time.
A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies.
arXiv Detail & Related papers (2023-07-12T20:14:06Z) - Continuous Space-Time Video Super-Resolution Utilizing Long-Range
Temporal Information [48.20843501171717]
We propose a continuous ST-VSR (CSTVSR) method that can convert the given video to any frame rate and spatial resolution.
We show that the proposed algorithm has good flexibility and achieves better performance on various datasets.
arXiv Detail & Related papers (2023-02-26T08:02:39Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [24.884078497381633]
We introduce a Transformer-based approach to video object segmentation (VOS)
Our attention-based approach allows a model to learn to attend over a history features of multiple frames.
Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness compared with the state of the art.
arXiv Detail & Related papers (2021-01-21T20:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.