Video Frame Interpolation with Transformer
- URL: http://arxiv.org/abs/2205.07230v1
- Date: Sun, 15 May 2022 09:30:28 GMT
- Title: Video Frame Interpolation with Transformer
- Authors: Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, Jiaya Jia
- Abstract summary: We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
- Score: 55.12620857638253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video frame interpolation (VFI), which aims to synthesize intermediate frames
of a video, has made remarkable progress with development of deep convolutional
networks over past years. Existing methods built upon convolutional networks
generally face challenges of handling large motion due to the locality of
convolution operations. To overcome this limitation, we introduce a novel
framework, which takes advantage of Transformer to model long-range pixel
correlation among video frames. Further, our network is equipped with a novel
cross-scale window-based attention mechanism, where cross-scale windows
interact with each other. This design effectively enlarges the receptive field
and aggregates multi-scale information. Extensive quantitative and qualitative
experiments demonstrate that our method achieves new state-of-the-art results
on various benchmarks.
Related papers
- Motion-aware Latent Diffusion Models for Video Frame Interpolation [51.78737270917301]
Motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity.
We propose a novel diffusion framework, motion-aware latent diffusion models (MADiff)
Our method achieves state-of-the-art performance significantly outperforming existing approaches.
arXiv Detail & Related papers (2024-04-21T05:09:56Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - Video Frame Interpolation with Flow Transformer [31.371987879960287]
Video frame has been actively studied with the development of convolutional neural networks.
We propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism.
Our framework is suitable for interpolating frames with large motion while maintaining reasonably low complexity.
arXiv Detail & Related papers (2023-07-30T06:44:37Z) - Efficient Convolution and Transformer-Based Network for Video Frame
Interpolation [11.036815066639473]
A novel method integrating a transformer encoder and convolutional features is proposed.
This network reduces the memory burden by close to 50% and runs up to four times faster during inference time.
A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies.
arXiv Detail & Related papers (2023-07-12T20:14:06Z) - Progressive Motion Context Refine Network for Efficient Video Frame
Interpolation [10.369068266836154]
Flow-based frame methods have achieved great success by first modeling optical flow between target and input frames, and then building synthesis network for target frame generation.
We propose a novel Progressive Motion Context Refine Network (PMCRNet) to predict motion fields and image context jointly for higher efficiency.
Experiments on multiple benchmarks show that proposed approaches not only achieve favorable and quantitative results but also reduces model size and running time significantly.
arXiv Detail & Related papers (2022-11-11T06:29:03Z) - Spatio-Temporal Multi-Flow Network for Video Frame Interpolation [3.6053802212032995]
Video frame (VFI) is a very active research topic, with applications spanning computer vision, post production and video encoding.
We present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture.
arXiv Detail & Related papers (2021-11-30T15:18:46Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.