Real-Time Monocular Human Depth Estimation and Segmentation on Embedded
Systems
- URL: http://arxiv.org/abs/2108.10506v1
- Date: Tue, 24 Aug 2021 03:26:08 GMT
- Title: Real-Time Monocular Human Depth Estimation and Segmentation on Embedded
Systems
- Authors: Shan An, Fangru Zhou, Mei Yang, Haogang Zhu, Changhong Fu, and
Konstantinos A. Tsintotas
- Abstract summary: Estimating a scene's depth to achieve collision avoidance against moving pedestrians is a crucial and fundamental problem in the robotic field.
This paper proposes a novel, low complexity network architecture for fast and accurate human depth estimation and segmentation in indoor environments.
- Score: 13.490605853268837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating a scene's depth to achieve collision avoidance against moving
pedestrians is a crucial and fundamental problem in the robotic field. This
paper proposes a novel, low complexity network architecture for fast and
accurate human depth estimation and segmentation in indoor environments, aiming
to applications for resource-constrained platforms (including battery-powered
aerial, micro-aerial, and ground vehicles) with a monocular camera being the
primary perception module. Following the encoder-decoder structure, the
proposed framework consists of two branches, one for depth prediction and
another for semantic segmentation. Moreover, network structure optimization is
employed to improve its forward inference speed. Exhaustive experiments on
three self-generated datasets prove our pipeline's capability to execute in
real-time, achieving higher frame rates than contemporary state-of-the-art
frameworks (114.6 frames per second on an NVIDIA Jetson Nano GPU with TensorRT)
while maintaining comparable accuracy.
Related papers
- A Hybrid Autoencoder for Robust Heightmap Generation from Fused Lidar and Depth Data for Humanoid Robot Locomotion [2.9223917785251285]
This paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation.<n>A hybrid-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction.<n>Results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations.
arXiv Detail & Related papers (2026-02-05T16:38:42Z) - Video Depth Propagation [54.523028170425256]
Existing methods rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies.<n>We propose VeloDepth, which effectively leverages an online video pipeline and performs deep feature propagation.<n>Our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency.
arXiv Detail & Related papers (2025-12-11T15:08:37Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence [4.60587070358843]
This paper presents a novel framework for real-time foreign object intrusion (FOI) detection and tracking in power transmission systems.<n>The framework integrates: (1) a YOLOv7 segmentation model for fast and robust object localization, (2) a ConvNeXt-based feature extractor trained with triplet loss to generate discriminative embeddings, and (3) a feature-assisted IoU tracker.<n>To enable scalable field deployment, the pipeline is optimized for deployment on low-cost edge hardware using mixed-precision inference.
arXiv Detail & Related papers (2025-09-16T17:17:03Z) - DashCam Video: A complementary low-cost data stream for on-demand forest-infrastructure system monitoring [1.6064410860203764]
This study introduces a novel, low-cost, and reproducible framework for real-time, object-level structural assessment and geolocation of roadside vegetation and infrastructure.<n>We developed an end-to-end pipeline that combines monocular depth estimation, depth error correction, and geometric triangulation to generate accurate spatial and structural data from vehicle-mounted dashcams.<n>Our approach complements conventional RS methods, such as LiDAR and image by offering a fast, real-time, and cost-effective solution for object-level monitoring of vegetation risks and infrastructure exposure.
arXiv Detail & Related papers (2025-08-15T16:55:12Z) - SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams [70.9610707466343]
Bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality.<n>Existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data.<n>We propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams.
arXiv Detail & Related papers (2025-05-26T04:14:34Z) - Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images [0.9883261192383611]
In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in unstructured environments.
We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly.
arXiv Detail & Related papers (2025-03-23T08:25:07Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - METER: a mobile vision transformer architecture for monocular depth
estimation [0.0]
We propose METER, a novel lightweight vision transformer architecture capable of achieving state of the art estimations.
We provide a solution consisting of three alternative configurations of METER, a novel loss function to balance pixel estimation and reconstruction of image details, and a new data augmentation strategy to improve the overall final predictions.
arXiv Detail & Related papers (2024-03-13T09:30:08Z) - Real-time Monocular Depth Estimation on Embedded Systems [32.40848141360501]
Two efficient RT-MonoDepth and RT-MonoDepth-S architectures are proposed.
RT-MonoDepth and RT-MonoDepth-S achieve frame rates of 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on Jetson AGX Orin.
arXiv Detail & Related papers (2023-08-21T08:59:59Z) - Rethinking Lightweight Salient Object Detection via Network Depth-Width
Tradeoff [26.566339984225756]
Existing salient object detection methods often adopt deeper and wider networks for better performance.
We propose a novel trilateral decoder framework by decoupling the U-shape structure into three complementary branches.
We show that our method achieves better efficiency-accuracy balance across five benchmarks.
arXiv Detail & Related papers (2023-01-17T03:43:25Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - Deep Learning for Real Time Satellite Pose Estimation on Low Power Edge
TPU [58.720142291102135]
In this paper we propose a pose estimation software exploiting neural network architectures.
We show how low power machine learning accelerators could enable Artificial Intelligence exploitation in space.
arXiv Detail & Related papers (2022-04-07T08:53:18Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Unsupervised Monocular Depth Learning with Integrated Intrinsics and
Spatio-Temporal Constraints [61.46323213702369]
This work presents an unsupervised learning framework that is able to predict at-scale depth maps and egomotion.
Our results demonstrate strong performance when compared to the current state-of-the-art on multiple sequences of the KITTI driving dataset.
arXiv Detail & Related papers (2020-11-02T22:26:58Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.