MAMo: Leveraging Memory and Attention for Monocular Video Depth
Estimation
- URL: http://arxiv.org/abs/2307.14336v2
- Date: Tue, 12 Sep 2023 21:28:51 GMT
- Title: MAMo: Leveraging Memory and Attention for Monocular Video Depth
Estimation
- Authors: Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek
Garrepalli, Fatih Porikli
- Abstract summary: We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation.
In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video.
We show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy.
- Score: 53.90194273249202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose MAMo, a novel memory and attention frame-work for monocular video
depth estimation. MAMo can augment and improve any single-image depth
estimation networks into video depth estimation models, enabling them to take
advantage of the temporal information to predict more accurate depth. In MAMo,
we augment model with memory which aids the depth prediction as the model
streams through the video. Specifically, the memory stores learned visual and
displacement tokens of the previous time instances. This allows the depth
network to cross-reference relevant features from the past when predicting
depth on the current frame. We introduce a novel scheme to continuously update
the memory, optimizing it to keep tokens that correspond with both the past and
the present visual information. We adopt attention-based approach to process
memory features where we first learn the spatio-temporal relation among the
resultant visual and displacement memory tokens using self-attention module.
Further, the output features of self-attention are aggregated with the current
visual features through cross-attention. The cross-attended features are
finally given to a decoder to predict depth on the current frame. Through
extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and
DDAD, we show that MAMo consistently improves monocular depth estimation
networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video
depth estimation provides higher accuracy with lower latency, when omparing to
SOTA cost-volume-based video depth models.
Related papers
- VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - FutureDepth: Learning to Predict the Future Improves Video Depth Estimation [46.421154770321266]
FutureDepth is a video depth estimation approach that implicitly leverage multi-frame and motion cues to improve depth estimation.
We show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy.
arXiv Detail & Related papers (2024-03-19T17:55:22Z) - Metrically Scaled Monocular Depth Estimation through Sparse Priors for
Underwater Robots [0.0]
We formulate a deep learning model that fuses sparse depth measurements from triangulated features to improve the depth predictions.
The network is trained in a supervised fashion on the forward-looking underwater dataset, FLSea.
The method achieves real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single CPU core.
arXiv Detail & Related papers (2023-10-25T16:32:31Z) - Lightweight Monocular Depth Estimation with an Edge Guided Network [34.03711454383413]
We present a novel lightweight Edge Guided Depth Estimation Network (EGD-Net)
In particular, we start out with a lightweight encoder-decoder architecture and embed an edge guidance branch.
In order to aggregate the context information and edge attention features, we design a transformer-based feature aggregation module.
arXiv Detail & Related papers (2022-09-29T14:45:47Z) - CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth [83.77839773394106]
We present a lightweight, tightly-coupled deep depth network and visual-inertial odometry system.
We provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction.
We show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.
arXiv Detail & Related papers (2020-12-18T09:42:54Z) - DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal
Fusion [67.64047158294062]
We propose an online multi-view depth prediction approach on posed video streams.
The scene geometry information computed in the previous time steps is propagated to the current time step.
We outperform the existing state-of-the-art multi-view stereo methods on most of the evaluated metrics.
arXiv Detail & Related papers (2020-12-03T18:54:03Z) - MiniNet: An extremely lightweight convolutional neural network for
real-time unsupervised monocular depth estimation [22.495019810166397]
We propose a new powerful network with a recurrent module to achieve the capability of a deep network.
We maintain an extremely lightweight size for real-time high performance unsupervised monocular depth prediction from video sequences.
Our new model can run at a speed of about 110 frames per second (fps) on a single GPU, 37 fps on a single CPU, and 2 fps on a Raspberry Pi 3.
arXiv Detail & Related papers (2020-06-27T12:13:22Z) - Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [92.84498980104424]
We put three different types of depth estimation into a common framework.
Our method produces a time series of depth maps.
It can be applied to monocular videos only or be combined with different types of sparse depth patterns.
arXiv Detail & Related papers (2020-01-08T16:50:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.