Lightweight Monocular Depth Estimation via Token-Sharing Transformer
- URL: http://arxiv.org/abs/2306.05682v1
- Date: Fri, 9 Jun 2023 05:51:40 GMT
- Title: Lightweight Monocular Depth Estimation via Token-Sharing Transformer
- Authors: Dong-Jae Lee, Jae Young Lee, Hyounguk Shon, Eojindl Yi, Yeong-Hun
Park, Sung-Sik Cho, Junmo Kim
- Abstract summary: Token-Sharing Transformer (TST) is an architecture using the Transformer for monocular depth estimation, optimized especially in embedded devices.
On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods.
- Score: 27.69898661818893
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Depth estimation is an important task in various robotics systems and
applications. In mobile robotics systems, monocular depth estimation is
desirable since a single RGB camera can be deployable at a low cost and compact
size. Due to its significant and growing needs, many lightweight monocular
depth estimation networks have been proposed for mobile robotics systems. While
most lightweight monocular depth estimation methods have been developed using
convolution neural networks, the Transformer has been gradually utilized in
monocular depth estimation recently. However, massive parameters and large
computational costs in the Transformer disturb the deployment to embedded
devices. In this paper, we present a Token-Sharing Transformer (TST), an
architecture using the Transformer for monocular depth estimation, optimized
especially in embedded devices. The proposed TST utilizes global token sharing,
which enables the model to obtain an accurate depth prediction with high
throughput in embedded devices. Experimental results show that TST outperforms
the existing lightweight monocular depth estimation methods. On the NYU Depth
v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and
142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods.
Furthermore, TST achieves real-time depth estimation of high-resolution images
on Jetson TX2 with competitive results.
Related papers
- METER: a mobile vision transformer architecture for monocular depth
estimation [0.0]
We propose METER, a novel lightweight vision transformer architecture capable of achieving state of the art estimations.
We provide a solution consisting of three alternative configurations of METER, a novel loss function to balance pixel estimation and reconstruction of image details, and a new data augmentation strategy to improve the overall final predictions.
arXiv Detail & Related papers (2024-03-13T09:30:08Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Deep Neighbor Layer Aggregation for Lightweight Self-Supervised
Monocular Depth Estimation [1.6775954077761863]
We present a fully convolutional depth estimation network using contextual feature fusion.
Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects.
Our method reduces the parameters without sacrificing accuracy.
arXiv Detail & Related papers (2023-09-17T13:40:15Z) - Real-time Monocular Depth Estimation on Embedded Systems [32.40848141360501]
Two efficient RT-MonoDepth and RT-MonoDepth-S architectures are proposed.
RT-MonoDepth and RT-MonoDepth-S achieve frame rates of 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on Jetson AGX Orin.
arXiv Detail & Related papers (2023-08-21T08:59:59Z) - UDepth: Fast Monocular Depth Estimation for Visually-guided Underwater
Robots [4.157415305926584]
We present a fast monocular depth estimation method for enabling 3D perception capabilities of low-cost underwater robots.
We formulate a novel end-to-end deep visual learning pipeline named UDepth, which incorporates domain knowledge of image formation characteristics of natural underwater scenes.
arXiv Detail & Related papers (2022-09-26T01:08:36Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - Deep Learning for Real Time Satellite Pose Estimation on Low Power Edge
TPU [58.720142291102135]
In this paper we propose a pose estimation software exploiting neural network architectures.
We show how low power machine learning accelerators could enable Artificial Intelligence exploitation in space.
arXiv Detail & Related papers (2022-04-07T08:53:18Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z) - Sparse Auxiliary Networks for Unified Monocular Depth Prediction and
Completion [56.85837052421469]
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars.
In this paper, we study the problem of predicting dense depth from a single RGB image with optional sparse measurements from low-cost active depth sensors.
We introduce Sparse Networks (SANs), a new module enabling monodepth networks to perform both the tasks of depth prediction and completion.
arXiv Detail & Related papers (2021-03-30T21:22:26Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.