METER: a mobile vision transformer architecture for monocular depth
estimation
- URL: http://arxiv.org/abs/2403.08368v1
- Date: Wed, 13 Mar 2024 09:30:08 GMT
- Title: METER: a mobile vision transformer architecture for monocular depth
estimation
- Authors: L. Papa, P. Russo, and I. Amerini
- Abstract summary: We propose METER, a novel lightweight vision transformer architecture capable of achieving state of the art estimations.
We provide a solution consisting of three alternative configurations of METER, a novel loss function to balance pixel estimation and reconstruction of image details, and a new data augmentation strategy to improve the overall final predictions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Depth estimation is a fundamental knowledge for autonomous systems that need
to assess their own state and perceive the surrounding environment. Deep
learning algorithms for depth estimation have gained significant interest in
recent years, owing to the potential benefits of this methodology in overcoming
the limitations of active depth sensing systems. Moreover, due to the low cost
and size of monocular cameras, researchers have focused their attention on
monocular depth estimation (MDE), which consists in estimating a dense depth
map from a single RGB video frame. State of the art MDE models typically rely
on vision transformers (ViT) architectures that are highly deep and complex,
making them unsuitable for fast inference on devices with hardware constraints.
Purposely, in this paper, we address the problem of exploiting ViT in MDE on
embedded devices. Those systems are usually characterized by limited memory
capabilities and low-power CPU/GPU. We propose METER, a novel lightweight
vision transformer architecture capable of achieving state of the art
estimations and low latency inference performances on the considered embedded
hardwares: NVIDIA Jetson TX1 and NVIDIA Jetson Nano. We provide a solution
consisting of three alternative configurations of METER, a novel loss function
to balance pixel estimation and reconstruction of image details, and a new data
augmentation strategy to improve the overall final predictions. The proposed
method outperforms previous lightweight works over the two benchmark datasets:
the indoor NYU Depth v2 and the outdoor KITTI.
Related papers
- HybridDepth: Robust Metric Depth Fusion by Leveraging Depth from Focus and Single-Image Priors [10.88048563201236]
We propose HYBRIDDEPTH, a robust depth estimation pipeline that addresses key challenges in depth estimation.
We test our pipeline as an end-to-end system, with a newly developed mobile client to capture focal stacks, which are then sent to a GPU-powered server for depth estimation.
Comprehensive quantitative and qualitative analyses demonstrate that HYBRIDDEPTH outperforms state-of-the-art(SOTA) models on common datasets.
arXiv Detail & Related papers (2024-07-26T00:51:52Z) - Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative
Convolution Network [80.19054069988559]
We find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency.
We propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth representation in two aspects.
Experiments show that our method achieves significant improvements on three widely used benchmarks.
arXiv Detail & Related papers (2023-08-10T14:32:18Z) - Lightweight Monocular Depth Estimation via Token-Sharing Transformer [27.69898661818893]
Token-Sharing Transformer (TST) is an architecture using the Transformer for monocular depth estimation, optimized especially in embedded devices.
On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods.
arXiv Detail & Related papers (2023-06-09T05:51:40Z) - Deep Learning for Real Time Satellite Pose Estimation on Low Power Edge
TPU [58.720142291102135]
In this paper we propose a pose estimation software exploiting neural network architectures.
We show how low power machine learning accelerators could enable Artificial Intelligence exploitation in space.
arXiv Detail & Related papers (2022-04-07T08:53:18Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Real-Time Monocular Human Depth Estimation and Segmentation on Embedded
Systems [13.490605853268837]
Estimating a scene's depth to achieve collision avoidance against moving pedestrians is a crucial and fundamental problem in the robotic field.
This paper proposes a novel, low complexity network architecture for fast and accurate human depth estimation and segmentation in indoor environments.
arXiv Detail & Related papers (2021-08-24T03:26:08Z) - Probabilistic and Geometric Depth: Detecting Objects in Perspective [78.00922683083776]
3D object detection is an important capability needed in various practical applications such as driver assistance systems.
Monocular 3D detection, as an economical solution compared to conventional settings relying on binocular vision or LiDAR, has drawn increasing attention recently but still yields unsatisfactory results.
This paper first presents a systematic study on this problem and observes that the current monocular 3D detection problem can be simplified as an instance depth estimation problem.
arXiv Detail & Related papers (2021-07-29T16:30:33Z) - Sparse Auxiliary Networks for Unified Monocular Depth Prediction and
Completion [56.85837052421469]
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars.
In this paper, we study the problem of predicting dense depth from a single RGB image with optional sparse measurements from low-cost active depth sensors.
We introduce Sparse Networks (SANs), a new module enabling monodepth networks to perform both the tasks of depth prediction and completion.
arXiv Detail & Related papers (2021-03-30T21:22:26Z) - On Deep Learning Techniques to Boost Monocular Depth Estimation for
Autonomous Navigation [1.9007546108571112]
Inferring the depth of images is a fundamental inverse problem within the field of Computer Vision.
We propose a new lightweight and fast supervised CNN architecture combined with novel feature extraction models.
We also introduce an efficient surface normals module, jointly with a simple geometric 2.5D loss function, to solve SIDE problems.
arXiv Detail & Related papers (2020-10-13T18:37:38Z) - MiniNet: An extremely lightweight convolutional neural network for
real-time unsupervised monocular depth estimation [22.495019810166397]
We propose a new powerful network with a recurrent module to achieve the capability of a deep network.
We maintain an extremely lightweight size for real-time high performance unsupervised monocular depth prediction from video sequences.
Our new model can run at a speed of about 110 frames per second (fps) on a single GPU, 37 fps on a single CPU, and 2 fps on a Raspberry Pi 3.
arXiv Detail & Related papers (2020-06-27T12:13:22Z) - DepthNet Nano: A Highly Compact Self-Normalizing Neural Network for
Monocular Depth Estimation [76.90627702089357]
DepthNet Nano is a compact deep neural network for monocular depth estimation designed using a human machine collaborative design strategy.
The proposed DepthNet Nano possesses a highly efficient network architecture, while still achieving comparable performance with state-of-the-art networks.
arXiv Detail & Related papers (2020-04-17T00:41:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.