FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
- URL: http://arxiv.org/abs/2403.12953v1
- Date: Tue, 19 Mar 2024 17:55:22 GMT
- Title: FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
- Authors: Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli,
- Abstract summary: FutureDepth is a video depth estimation approach that implicitly leverage multi-frame and motion cues to improve depth estimation.
We show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy.
- Score: 46.421154770321266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - MAMo: Leveraging Memory and Attention for Monocular Video Depth
Estimation [53.90194273249202]
We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation.
In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video.
We show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy.
arXiv Detail & Related papers (2023-07-26T17:55:32Z) - How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video
Depth Forecasting [23.134156184783357]
We present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene.
This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video.
arXiv Detail & Related papers (2022-07-01T15:51:17Z) - Global-Local Path Networks for Monocular Depth Estimation with Vertical
CutDepth [24.897377434844266]
We propose a novel structure and training strategy for monocular depth estimation.
We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder.
Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2.
arXiv Detail & Related papers (2022-01-19T06:37:21Z) - 3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method.
Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions.
We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z) - Retrieval Augmentation to Improve Robustness and Interpretability of
Deep Neural Networks [3.0410237490041805]
In this work, we actively exploit the training data to improve the robustness and interpretability of deep neural networks.
Specifically, the proposed approach uses the target of the nearest input example to initialize the memory state of an LSTM model or to guide attention mechanisms.
Results show the effectiveness of the proposed models for the two tasks, on the widely used Flickr8 and IMDB datasets.
arXiv Detail & Related papers (2021-02-25T17:38:31Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Deep feature fusion for self-supervised monocular depth prediction [7.779007880126907]
We propose a deep feature fusion method for learning self-supervised depth from scratch.
Our fusion network selects features from both upper and lower levels at every level in the encoder network.
We also propose a refinement module learning higher scale residual depth from a combination of higher level deep features and lower level residual depth.
arXiv Detail & Related papers (2020-05-16T09:42:36Z) - On the performance of deep learning models for time series
classification in streaming [0.0]
This work is to assess the performance of different types of deep architectures for data streaming classification.
We evaluate models such as multi-layer perceptrons, recurrent, convolutional and temporal convolutional neural networks over several time-series datasets.
arXiv Detail & Related papers (2020-03-05T11:41:29Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.