Related papers: Video Frame Interpolation with Transformer

Video Frame Interpolation with Transformer

URL: http://arxiv.org/abs/2205.07230v1
Date: Sun, 15 May 2022 09:30:28 GMT
Title: Video Frame Interpolation with Transformer
Authors: Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, Jiaya Jia
Abstract summary: We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
Score: 55.12620857638253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video frame interpolation (VFI), which aims to synthesize intermediate frames of a video, has made remarkable progress with development of deep convolutional networks over past years. Existing methods built upon convolutional networks generally face challenges of handling large motion due to the locality of convolution operations. To overcome this limitation, we introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Further, our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other. This design effectively enlarges the receptive field and aggregates multi-scale information. Extensive quantitative and qualitative experiments demonstrate that our method achieves new state-of-the-art results on various benchmarks.

Related papers

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields [39.214857326425204]
Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. We propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. Our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets.
arXiv Detail & Related papers (2025-02-19T13:40:43Z)
Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation [0.0]
We present a conditional encoder designed to adapt an image-to-video model for a large-motion frame. To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism. Our approach demonstrates superior performance on the Fr'teche Video Distance metric when evaluated against other state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-22T14:49:55Z)
Motion-aware Latent Diffusion Models for Video Frame Interpolation [51.78737270917301]
Motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. We propose a novel diffusion framework, motion-aware latent diffusion models (MADiff) Our method achieves state-of-the-art performance significantly outperforming existing approaches.
arXiv Detail & Related papers (2024-04-21T05:09:56Z)
Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames. It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z)
Video Frame Interpolation with Flow Transformer [31.371987879960287]
Video frame has been actively studied with the development of convolutional neural networks. We propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism. Our framework is suitable for interpolating frames with large motion while maintaining reasonably low complexity.
arXiv Detail & Related papers (2023-07-30T06:44:37Z)
Efficient Convolution and Transformer-Based Network for Video Frame Interpolation [11.036815066639473]
A novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies.
arXiv Detail & Related papers (2023-07-12T20:14:06Z)
Progressive Motion Context Refine Network for Efficient Video Frame Interpolation [10.369068266836154]
Flow-based frame methods have achieved great success by first modeling optical flow between target and input frames, and then building synthesis network for target frame generation. We propose a novel Progressive Motion Context Refine Network (PMCRNet) to predict motion fields and image context jointly for higher efficiency. Experiments on multiple benchmarks show that proposed approaches not only achieve favorable and quantitative results but also reduces model size and running time significantly.
arXiv Detail & Related papers (2022-11-11T06:29:03Z)
Spatio-Temporal Multi-Flow Network for Video Frame Interpolation [3.6053802212032995]
Video frame (VFI) is a very active research topic, with applications spanning computer vision, post production and video encoding. We present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture.
arXiv Detail & Related papers (2021-11-30T15:18:46Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames. We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z)
Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.