Temporally Consistent Online Depth Estimation in Dynamic Scenes
- URL: http://arxiv.org/abs/2111.09337v1
- Date: Wed, 17 Nov 2021 19:00:51 GMT
- Title: Temporally Consistent Online Depth Estimation in Dynamic Scenes
- Authors: Zhaoshuo Li, Wei Ye, Dilin Wang, Francis X. Creighton, Russell H.
Taylor, Ganesh Venkatesh, Mathias Unberath
- Abstract summary: Temporally consistent depth estimation is crucial for real-time applications such as augmented reality.
We present a technique to produce temporally consistent depth estimates in dynamic scenes in an online setting.
Our network augments current per-frame stereo networks with novel motion and fusion networks.
- Score: 17.186528244457055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporally consistent depth estimation is crucial for real-time applications
such as augmented reality. While stereo depth estimation has received
substantial attention that led to improvements on a frame-by-frame basis, there
is relatively little work focused on maintaining temporal consistency across
frames. Indeed, based on our analysis, current stereo depth estimation
techniques still suffer from poor temporal consistency. Stabilizing depth
temporally in dynamic scenes is challenging due to concurrent object and camera
motion. In an online setting, this process is further aggravated because only
past frames are available. In this paper, we present a technique to produce
temporally consistent depth estimates in dynamic scenes in an online setting.
Our network augments current per-frame stereo networks with novel motion and
fusion networks. The motion network accounts for both object and camera motion
by predicting a per-pixel SE3 transformation. The fusion network improves
consistency in prediction by aggregating the current and previous predictions
with regressed weights. We conduct extensive experiments across varied datasets
(synthetic, outdoor, indoor and medical). In both zero-shot generalization and
domain fine-tuning, we demonstrate that our proposed approach outperforms
competing methods in terms of temporal stability and per-frame accuracy, both
quantitatively and qualitatively. Our code will be available online.
Related papers
- Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z) - Temporally Consistent Online Depth Estimation Using Point-Based Fusion [6.5514240555359455]
We aim to estimate temporally consistent depth maps of video streams in an online setting.
This is a difficult problem as future frames are not available and the method must choose between enforcing consistency and correcting errors from previous estimations.
We propose to address these challenges by using a global point cloud that is dynamically updated each frame, along with a learned fusion approach in image space.
arXiv Detail & Related papers (2023-04-15T00:04:18Z) - Multi-view reconstruction of bullet time effect based on improved NSFF
model [2.5698815501864924]
Bullet time is a type of visual effect commonly used in film, television and games.
This paper reconstructed the common time special effects scenes in movies and television from a new perspective.
arXiv Detail & Related papers (2023-04-01T14:58:00Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Minimum Latency Deep Online Video Stabilization [77.68990069996939]
We present a novel camera path optimization framework for the task of online video stabilization.
In this work, we adopt recent off-the-shelf high-quality deep motion models for motion estimation to recover the camera trajectory.
Our approach significantly outperforms state-of-the-art online methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-05T07:37:32Z) - Less is More: Consistent Video Depth Estimation with Masked Frames
Modeling [41.177591332503255]
Temporal consistency is the key challenge of video depth estimation.
We propose a frame masking network (FMNet) predicting the depth of masked frames based on their neighboring frames.
Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information.
arXiv Detail & Related papers (2022-07-31T07:11:20Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Consistency Guided Scene Flow Estimation [159.24395181068218]
CGSF is a self-supervised framework for the joint reconstruction of 3D scene structure and motion from stereo video.
We show that the proposed model can reliably predict disparity and scene flow in challenging imagery.
It achieves better generalization than the state-of-the-art, and adapts quickly and robustly to unseen domains.
arXiv Detail & Related papers (2020-06-19T17:28:07Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.