LMDepth: Lightweight Mamba-based Monocular Depth Estimation for Real-World Deployment
- URL: http://arxiv.org/abs/2505.00980v1
- Date: Fri, 02 May 2025 04:00:03 GMT
- Title: LMDepth: Lightweight Mamba-based Monocular Depth Estimation for Real-World Deployment
- Authors: Jiahuan Long, Xin Zhou,
- Abstract summary: LMDepth is a lightweight monocular depth estimation network designed to reconstruct high-precision depth information.<n>We show that LMDepth achieves higher performance with fewer parameters and lower computational complexity.<n>We further deploy LMDepth on an embedded platform with INT8 quantization, validating its practicality for real-world edge applications.
- Score: 3.8883236454187347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular depth estimation provides an additional depth dimension to RGB images, making it widely applicable in various fields such as virtual reality, autonomous driving and robotic navigation. However, existing depth estimation algorithms often struggle to effectively balance performance and computational efficiency, which poses challenges for deployment on resource-constrained devices. To address this, we propose LMDepth, a lightweight Mamba-based monocular depth estimation network, designed to reconstruct high-precision depth information while maintaining low computational overhead. Specifically, we propose a modified pyramid spatial pooling module that serves as a multi-scale feature aggregator and context extractor, ensuring global spatial information for accurate depth estimation. Moreover, we integrate multiple depth Mamba blocks into the decoder. Designed with linear computations, the Mamba Blocks enable LMDepth to efficiently decode depth information from global features, providing a lightweight alternative to Transformer-based architectures that depend on complex attention mechanisms. Extensive experiments on the NYUDv2 and KITTI datasets demonstrate the effectiveness of our proposed LMDepth. Compared to previous lightweight depth estimation methods, LMDepth achieves higher performance with fewer parameters and lower computational complexity (measured by GFLOPs). We further deploy LMDepth on an embedded platform with INT8 quantization, validating its practicality for real-world edge applications.
Related papers
- Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective [54.91271106816616]
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency.<n>We propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives.<n> Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps.<n>For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities.<n>For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework.
arXiv Detail & Related papers (2025-05-07T19:37:20Z) - QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z) - Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation [108.04354143020886]
We introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything.<n>We use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution.
arXiv Detail & Related papers (2024-12-18T16:32:12Z) - Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.<n>Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.<n>We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - Deep Neighbor Layer Aggregation for Lightweight Self-Supervised
Monocular Depth Estimation [1.6775954077761863]
We present a fully convolutional depth estimation network using contextual feature fusion.
Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects.
Our method reduces the parameters without sacrificing accuracy.
arXiv Detail & Related papers (2023-09-17T13:40:15Z) - Depth Completion with Multiple Balanced Bases and Confidence for Dense Monocular SLAM [33.66705447919248]
We propose a novel method that integrates a light-weight depth completion network into a sparse SLAM system.<n>Specifically, we present a specifically optimized multi-basis depth completion network, called BBC-Net.<n>BBC-Net can predict multiple balanced bases and a confidence map from a monocular image with sparse points generated by off-the-shelf keypoint-based SLAM systems.
arXiv Detail & Related papers (2023-09-08T06:15:27Z) - EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with
CNN-Transformer [0.0]
We propose a novel lightweight solution named EndoDepthL that integrates CNN and Transformers to predict multi-scale depth maps.
Our approach includes optimizing the network architecture, incorporating multi-scale dilated convolution, and a multi-channel attention mechanism.
To better evaluate the performance of monocular depth estimation in endoscopic imaging, we propose a novel complexity evaluation metric.
arXiv Detail & Related papers (2023-08-04T21:38:29Z) - Monocular Visual-Inertial Depth Estimation [66.71452943981558]
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry.
Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment.
We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in RMSE with dense scale alignment.
arXiv Detail & Related papers (2023-03-21T18:47:34Z) - Learning an Efficient Multimodal Depth Completion Model [11.740546882538142]
RGB image-guided sparse depth completion has attracted extensive attention recently, but still faces some problems.
The proposed method can outperform some state-of-the-art methods with a lightweight architecture.
The method also wins the championship in the MIPI2022 RGB+TOF depth completion challenge.
arXiv Detail & Related papers (2022-08-23T07:03:14Z) - Struct-MDC: Mesh-Refined Unsupervised Depth Completion Leveraging
Structural Regularities from Visual SLAM [1.8899300124593648]
Feature-based visual simultaneous localization and mapping (SLAM) methods only estimate the depth of extracted features.
depth completion tasks that estimate a dense depth from a sparse depth have gained significant importance in robotic applications like exploration.
We propose a mesh depth refinement (MDR) module to address this problem.
The Struct-MDC outperforms other state-of-the-art algorithms on public and our custom datasets.
arXiv Detail & Related papers (2022-04-29T04:29:17Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.