RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment
- URL: http://arxiv.org/abs/2511.14107v1
- Date: Tue, 18 Nov 2025 03:47:04 GMT
- Title: RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment
- Authors: Zeyu Cheng, Tongfei Liu, Tao Lei, Xiang Hua, Yi Zhang, Chengkai Tang,
- Abstract summary: RTS-Mono is a lightweight and efficient encoder-decoder architecture.<n>It achieves state-of-the-art (SoTA) performance in high and low resolutions.<n>It can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS.
- Score: 10.19871006168469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model's size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at https://github.com/ZYCheng777/RTS-Mono.
Related papers
- Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design [72.55935017828891]
We present Le-DETR (textbfLow-cost and textbfEfficient textbfDEtection textbfTRansformer)<n>It achieves a new textbfSOTA in real-time detection using only ImageNet1K and COCO 2017 training datasets.<n>It surpasses YOLOv12-L/X by textbf+0.6/-0.1 mAP while achieving similar speed and textbf+20% speedup.
arXiv Detail & Related papers (2026-02-24T15:29:55Z) - ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving [62.9051914830949]
We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving.<n>A lightweight acquisition pipeline ensures scalable collection, while sparse but statistically sufficient ground truth supports robust training.<n> Benchmarking with state-of-the-art monocular depth models reveals severe cross-dataset generalization failures.
arXiv Detail & Related papers (2025-08-19T16:13:49Z) - LMDepth: Lightweight Mamba-based Monocular Depth Estimation for Real-World Deployment [3.8883236454187347]
LMDepth is a lightweight monocular depth estimation network designed to reconstruct high-precision depth information.<n>We show that LMDepth achieves higher performance with fewer parameters and lower computational complexity.<n>We further deploy LMDepth on an embedded platform with INT8 quantization, validating its practicality for real-world edge applications.
arXiv Detail & Related papers (2025-05-02T04:00:03Z) - MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection [72.46396769642787]
We develop a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient infrared small target detection.
MiM-ISTD is $8 times$ faster than the SOTA method and reduces GPU memory usage by 62.2$%$ when testing on $2048 times 2048$ images.
arXiv Detail & Related papers (2024-03-04T15:57:29Z) - Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.<n>Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.<n>We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - Deep Neighbor Layer Aggregation for Lightweight Self-Supervised
Monocular Depth Estimation [1.6775954077761863]
We present a fully convolutional depth estimation network using contextual feature fusion.
Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects.
Our method reduces the parameters without sacrificing accuracy.
arXiv Detail & Related papers (2023-09-17T13:40:15Z) - Real-time Monocular Depth Estimation on Embedded Systems [32.40848141360501]
Two efficient RT-MonoDepth and RT-MonoDepth-S architectures are proposed.
RT-MonoDepth and RT-MonoDepth-S achieve frame rates of 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on Jetson AGX Orin.
arXiv Detail & Related papers (2023-08-21T08:59:59Z) - Lite-Mono: A Lightweight CNN and Transformer Architecture for
Self-Supervised Monocular Depth Estimation [9.967643080731683]
We investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono.
A full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
arXiv Detail & Related papers (2022-11-23T18:43:41Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - Deep Learning for Real Time Satellite Pose Estimation on Low Power Edge
TPU [58.720142291102135]
In this paper we propose a pose estimation software exploiting neural network architectures.
We show how low power machine learning accelerators could enable Artificial Intelligence exploitation in space.
arXiv Detail & Related papers (2022-04-07T08:53:18Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.