Less is More: Consistent Video Depth Estimation with Masked Frames
Modeling
- URL: http://arxiv.org/abs/2208.00380v1
- Date: Sun, 31 Jul 2022 07:11:20 GMT
- Title: Less is More: Consistent Video Depth Estimation with Masked Frames
Modeling
- Authors: Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, Jianming Zhang
- Abstract summary: Temporal consistency is the key challenge of video depth estimation.
We propose a frame masking network (FMNet) predicting the depth of masked frames based on their neighboring frames.
Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information.
- Score: 41.177591332503255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal consistency is the key challenge of video depth estimation. Previous
works are based on additional optical flow or camera poses, which is
time-consuming. By contrast, we derive consistency with less information. Since
videos inherently exist with heavy temporal redundancy, a missing frame could
be recovered from neighboring ones. Inspired by this, we propose the frame
masking network (FMNet), a spatial-temporal transformer network predicting the
depth of masked frames based on their neighboring frames. By reconstructing
masked temporal features, the FMNet can learn intrinsic inter-frame
correlations, which leads to consistency. Compared with prior arts,
experimental results demonstrate that our approach achieves comparable spatial
accuracy and higher temporal consistency without any additional information.
Our work provides a new perspective on consistent video depth estimation.
Related papers
- Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Temporally Consistent Online Depth Estimation Using Point-Based Fusion [6.5514240555359455]
We aim to estimate temporally consistent depth maps of video streams in an online setting.
This is a difficult problem as future frames are not available and the method must choose between enforcing consistency and correcting errors from previous estimations.
We propose to address these challenges by using a global point cloud that is dynamically updated each frame, along with a learned fusion approach in image space.
arXiv Detail & Related papers (2023-04-15T00:04:18Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Temporally Consistent Online Depth Estimation in Dynamic Scenes [17.186528244457055]
Temporally consistent depth estimation is crucial for real-time applications such as augmented reality.
We present a technique to produce temporally consistent depth estimates in dynamic scenes in an online setting.
Our network augments current per-frame stereo networks with novel motion and fusion networks.
arXiv Detail & Related papers (2021-11-17T19:00:51Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.