How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video
Depth Forecasting
- URL: http://arxiv.org/abs/2207.00506v1
- Date: Fri, 1 Jul 2022 15:51:17 GMT
- Title: How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video
Depth Forecasting
- Authors: Suaradip Nag, Nisarg Shah, Anran Qi, Raghavendra Ramachandra
- Abstract summary: We present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene.
This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video.
- Score: 23.134156184783357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we present a novel self-supervised method to anticipate the
depth estimate for a future, unobserved real-world urban scene. This work is
the first to explore self-supervised learning for estimation of monocular depth
of future unobserved frames of a video. Existing works rely on a large number
of annotated samples to generate the probabilistic prediction of depth for
unseen frames. However, this makes it unrealistic due to its requirement for
large amount of annotated depth samples of video. In addition, the
probabilistic nature of the case, where one past can have multiple future
outcomes often leads to incorrect depth estimates. Unlike previous methods, we
model the depth estimation of the unobserved frame as a view-synthesis problem,
which treats the depth estimate of the unseen video frame as an auxiliary task
while synthesizing back the views using learned pose. This approach is not only
cost effective - we do not use any ground truth depth for training (hence
practical) but also deterministic (a sequence of past frames map to an
immediate future). To address this task we first develop a novel depth
forecasting network DeFNet which estimates depth of unobserved future by
forecasting latent features. Second, we develop a channel-attention based pose
estimation network that estimates the pose of the unobserved frame. Using this
learned pose, estimated depth map is reconstructed back into the image domain,
thus forming a self-supervised solution. Our proposed approach shows
significant improvements in Abs Rel metric compared to state-of-the-art
alternatives on both short and mid-term forecasting setting, benchmarked on
KITTI and Cityscapes. Code is available at
https://github.com/sauradip/depthForecasting
Related papers
- Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - FutureDepth: Learning to Predict the Future Improves Video Depth Estimation [46.421154770321266]
FutureDepth is a video depth estimation approach that implicitly leverage multi-frame and motion cues to improve depth estimation.
We show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy.
arXiv Detail & Related papers (2024-03-19T17:55:22Z) - Range-Agnostic Multi-View Depth Estimation With Keyframe Selection [33.99466211478322]
Methods for 3D reconstruction from posed frames require prior knowledge about the scene metric range.
RAMDepth is an efficient and purely 2D framework that reverses the depth estimation and matching steps order.
arXiv Detail & Related papers (2024-01-25T18:59:42Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - On the Sins of Image Synthesis Loss for Self-supervised Depth Estimation [60.780823530087446]
We show that improvements in image synthesis do not necessitate improvement in depth estimation.
We attribute this diverging phenomenon to aleatoric uncertainties, which originate from data.
This observed divergence has not been previously reported or studied in depth.
arXiv Detail & Related papers (2021-09-13T17:57:24Z) - Occlusion-Aware Depth Estimation with Adaptive Normal Constraints [85.44842683936471]
We present a new learning-based method for multi-frame depth estimation from a color video.
Our method outperforms the state-of-the-art in terms of depth estimation accuracy.
arXiv Detail & Related papers (2020-04-02T07:10:45Z) - Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [92.84498980104424]
We put three different types of depth estimation into a common framework.
Our method produces a time series of depth maps.
It can be applied to monocular videos only or be combined with different types of sparse depth patterns.
arXiv Detail & Related papers (2020-01-08T16:50:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.