Three-Stage Cascade Framework for Blurry Video Frame Interpolation
- URL: http://arxiv.org/abs/2310.05383v1
- Date: Mon, 9 Oct 2023 03:37:30 GMT
- Title: Three-Stage Cascade Framework for Blurry Video Frame Interpolation
- Authors: Pengcheng Lei, Zaoming Yan, Tingting Wang, Faming Fang and Guixu Zhang
- Abstract summary: Blurry video frame (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos.
BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance.
We propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos.
- Score: 23.38547327916875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Blurry video frame interpolation (BVFI) aims to generate high-frame-rate
clear videos from low-frame-rate blurry videos, is a challenging but important
topic in the computer vision community. Blurry videos not only provide spatial
and temporal information like clear videos, but also contain additional motion
information hidden in each blurry frame. However, existing BVFI methods usually
fail to fully leverage all valuable information, which ultimately hinders their
performance. In this paper, we propose a simple end-to-end three-stage
framework to fully explore useful information from blurry videos. The frame
interpolation stage designs a temporal deformable network to directly sample
useful information from blurry inputs and synthesize an intermediate frame at
an arbitrary time interval. The temporal feature fusion stage explores the
long-term temporal information for each target frame through a bi-directional
recurrent deformable alignment network. And the deblurring stage applies a
transformer-empowered Taylor approximation network to recursively recover the
high-frequency details. The proposed three-stage framework has clear task
assignment for each module and offers good expandability, the effectiveness of
which are demonstrated by various experimental results. We evaluate our model
on four benchmarks, including the Adobe240 dataset, GoPro dataset, YouTube240
dataset and Sony dataset. Quantitative and qualitative results indicate that
our model outperforms existing SOTA methods. Besides, experiments on real-world
blurry videos also indicate the good generalization ability of our model.
Related papers
- CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring [44.30048301161034]
Video deblurring aims to enhance the quality of restored results in motion-red videos by gathering information from adjacent video frames.
We propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, and 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames.
We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets.
arXiv Detail & Related papers (2024-08-27T10:09:17Z) - LoopAnimate: Loopable Salient Object Animation [19.761865029125524]
LoopAnimate is a novel method for generating videos with consistent start and end frames.
It achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
arXiv Detail & Related papers (2024-04-14T07:36:18Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Aggregating Long-term Sharp Features via Hybrid Transformers for Video
Deblurring [76.54162653678871]
We propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation.
Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Temporal-Spatial Feature Pyramid for Video Saliency Detection [2.578242050187029]
We propose a 3D fully convolutional encoder-decoder architecture for video saliency detection.
Our model is simple yet effective, and can run in real time.
arXiv Detail & Related papers (2021-05-10T09:14:14Z) - FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation [97.99012124785177]
FLAVR is a flexible and efficient architecture that uses 3D space-time convolutions to enable end-to-end learning and inference for video framesupervised.
We demonstrate that FLAVR can serve as a useful self- pretext task for action recognition, optical flow estimation, and motion magnification.
arXiv Detail & Related papers (2020-12-15T18:59:30Z) - ALANET: Adaptive Latent Attention Network forJoint Video Deblurring and
Interpolation [38.52446103418748]
We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos.
We employ combination of self-attention and cross-attention module between consecutive frames in the latent space to generate optimized representation for each frame.
Our method performs favorably against various state-of-the-art approaches, even though we tackle a much more difficult problem.
arXiv Detail & Related papers (2020-08-31T21:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.