Exploring Discontinuity for Video Frame Interpolation
- URL: http://arxiv.org/abs/2202.07291v5
- Date: Thu, 23 Mar 2023 04:44:42 GMT
- Title: Exploring Discontinuity for Video Frame Interpolation
- Authors: Sangjin Lee, Hyeongmin Lee, Chajin Shin, Hanbin Son, Sangyoun Lee
- Abstract summary: We propose three techniques to make the existing deep learning-based VFI architectures robust to discontinuous motions.
First is a novel data augmentation strategy called figure-text mixing (FTM) which can make the models learn discontinuous motions.
Second, we propose a simple but effective module that predicts a map called discontinuity map (D-map) which densely distinguishes between areas of continuous and discontinuous motions.
- Score: 7.061238509514182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video frame interpolation (VFI) is the task that synthesizes the intermediate
frame given two consecutive frames. Most of the previous studies have focused
on appropriate frame warping operations and refinement modules for the warped
frames. These studies have been conducted on natural videos containing only
continuous motions. However, many practical videos contain various unnatural
objects with discontinuous motions such as logos, user interfaces and
subtitles. We propose three techniques to make the existing deep learning-based
VFI architectures robust to these elements. First is a novel data augmentation
strategy called figure-text mixing (FTM) which can make the models learn
discontinuous motions during training stage without any extra dataset. Second,
we propose a simple but effective module that predicts a map called
discontinuity map (D-map), which densely distinguishes between areas of
continuous and discontinuous motions. Lastly, we propose loss functions to give
supervisions of the discontinuous motion areas which can be applied along with
FTM and D-map. We additionally collect a special test benchmark called
Graphical Discontinuous Motion (GDM) dataset consisting of some mobile games
and chatting videos. Applied to the various state-of-the-art VFI networks, our
method significantly improves the interpolation qualities on the videos from
not only GDM dataset, but also the existing benchmarks containing only
continuous motions such as Vimeo90K, UCF101, and DAVIS.
Related papers
- Unfolding Videos Dynamics via Taylor Expansion [5.723852805622308]
We present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi)
ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence.
ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings.
arXiv Detail & Related papers (2024-09-04T01:41:09Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance.
We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.