Related papers: Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

URL: http://arxiv.org/abs/2303.00440v1
Date: Wed, 1 Mar 2023 12:00:15 GMT
Title: Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation
Authors: Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, Limin Wang
Abstract summary: We propose a novel module to explicitly extract motion and appearance information via a unifying operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. For both fixed- and arbitrary-timestep, our method achieves state-of-the-art performance on various datasets.
Score: 46.23787695590861
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of information in a mixed way or elaborate separate modules for each type of information, which lead to representation ambiguity and low efficiency. In this paper, we propose a novel module to explicitly extract motion and appearance information via a unifying operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. Furthermore, for efficient VFI, our proposed module could be seamlessly integrated into a hybrid CNN and Transformer architecture. This hybrid pipeline can alleviate the computational complexity of inter-frame attention as well as preserve detailed low-level structure information. Experimental results demonstrate that, for both fixed- and arbitrary-timestep interpolation, our method achieves state-of-the-art performance on various datasets. Meanwhile, our approach enjoys a lighter computation overhead over models with close performance. The source code and models are available at https://github.com/MCG-NJU/EMA-VFI.

Related papers

SILK: Smooth InterpoLation frameworK for motion in-betweening A Simplified Computational Approach [1.7812314225208412]
Motion in-betweening is a crucial tool for animators, enabling control over pose-level details in each pose.<n>Recent machine learning solutions for motion in-betweening rely on complex models, skeleton-aware architectures or requiring multiple modules and training steps.<n>We introduce a simple yet effective Transformer-based framework, employing a single encoder to synthesize realistic motions for motion in-betweening tasks.
arXiv Detail & Related papers (2025-06-09T19:26:27Z)
Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields [39.214857326425204]
Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. We propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. Our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets.
arXiv Detail & Related papers (2025-02-19T13:40:43Z)
Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames. It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z)
SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation [28.19471998380114]
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. We propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification.
arXiv Detail & Related papers (2023-11-30T06:44:44Z)
Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation [47.7036344302777]
Current Object Video reference methods follow the pipeline of extraction-then-matching. We propose a unified VOS framework, coined as JointFormer, for jointly feature modeling, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z)
A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc. Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z)
NIO: Lightweight neural operator-based architecture for video frame interpolation [15.875579519177487]
NIO is a lightweight, efficient neural operator-based architecture to perform video frame-by-frame learning. We show that NIO can produce visually-smooth and accurate results and converges in fewer epochs than state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-19T20:30:47Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion [68.45737688496654]
We present a modular interactive VOS framework which decouples interaction-to-mask and mask propagation. We show that our method outperforms current state-of-the-art algorithms while requiring fewer frame interactions.
arXiv Detail & Related papers (2021-03-14T14:39:08Z)
All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time. This work introduces a true multi-frame interpolator. It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z)
TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map. Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.