OmniUnet: A Multimodal Network for Unstructured Terrain Segmentation on Planetary Rovers Using RGB, Depth, and Thermal Imagery
- URL: http://arxiv.org/abs/2508.00580v1
- Date: Fri, 01 Aug 2025 12:23:29 GMT
- Title: OmniUnet: A Multimodal Network for Unstructured Terrain Segmentation on Planetary Rovers Using RGB, Depth, and Thermal Imagery
- Authors: Raul Castilla-Arquillo, Carlos Perez-del-Pulgar, Levin Gerdes, Alfonso Garcia-Cerezo, Miguel A. Olivares-Mendez,
- Abstract summary: This work presents OmniUnet, a transformer-based neural network architecture for semantic segmentation using RGB, depth, and thermal imagery.<n>A custom multimodal sensor housing was developed using 3D printing and mounted on the Martian Rover Testbed for Autonomy.<n>A subset of this dataset was manually labeled to support supervised training of the network.<n>Inference tests yielded an average prediction time of 673 ms on a resource-constrained computer.
- Score: 0.5837061763460748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robot navigation in unstructured environments requires multimodal perception systems that can support safe navigation. Multimodality enables the integration of complementary information collected by different sensors. However, this information must be processed by machine learning algorithms specifically designed to leverage heterogeneous data. Furthermore, it is necessary to identify which sensor modalities are most informative for navigation in the target environment. In Martian exploration, thermal imagery has proven valuable for assessing terrain safety due to differences in thermal behaviour between soil types. This work presents OmniUnet, a transformer-based neural network architecture for semantic segmentation using RGB, depth, and thermal (RGB-D-T) imagery. A custom multimodal sensor housing was developed using 3D printing and mounted on the Martian Rover Testbed for Autonomy (MaRTA) to collect a multimodal dataset in the Bardenas semi-desert in northern Spain. This location serves as a representative environment of the Martian surface, featuring terrain types such as sand, bedrock, and compact soil. A subset of this dataset was manually labeled to support supervised training of the network. The model was evaluated both quantitatively and qualitatively, achieving a pixel accuracy of 80.37% and demonstrating strong performance in segmenting complex unstructured terrain. Inference tests yielded an average prediction time of 673 ms on a resource-constrained computer (Jetson Orin Nano), confirming its suitability for on-robot deployment. The software implementation of the network and the labeled dataset have been made publicly available to support future research in multimodal terrain perception for planetary robotics.
Related papers
- MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments [0.5339846068056558]
We introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset for landmine detection.<n>The dataset features 35 different targets distributed along three distinct tracks, providing a diverse and realistic testing environment.<n>MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms.
arXiv Detail & Related papers (2025-06-05T10:08:24Z) - TUM2TWIN: Introducing the Large-Scale Multimodal Urban Digital Twin Benchmark Dataset [90.97440987655084]
Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources.<n>To address these challenges, we introduce the first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN.<n>This dataset includes georeferenced, semantically aligned 3D models and networks along with various terrestrial, mobile, aerial, and satellite observations boasting 32 data subsets over roughly 100,000 $m2$ and currently 767 GB of data.
arXiv Detail & Related papers (2025-05-12T09:48:32Z) - CompSLAM: Complementary Hierarchical Multi-Modal Localization and Mapping for Robot Autonomy in Underground Environments [38.264929235624905]
CompSLAM is a multi-modal localization and mapping framework for robots.<n>It was deployed on all aerial, legged, and wheeled robots of Team Cerberus during their competition-winning final run.<n>This paper also introduces a dataset acquired by a manually teleoperated quadrupedal robot, covering a significant portion of the DARPA Subterranean Challenge finals course.
arXiv Detail & Related papers (2025-05-10T00:59:31Z) - Are We Ready for Real-Time LiDAR Semantic Segmentation in Autonomous Driving? [42.348499880894686]
Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks.
We investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource-constrained inference on embedded NVIDIA Jetson platforms.
arXiv Detail & Related papers (2024-10-10T20:47:33Z) - Artifacts Mapping: Multi-Modal Semantic Mapping for Object Detection and
3D Localization [13.473742114288616]
We propose a framework that can autonomously detect and localize objects in a known environment.
The framework consists of three key elements: understanding the environment through RGB data, estimating depth through multi-modal sensor fusion, and managing artifacts.
Experiments show that the proposed framework can accurately detect 98% of the objects in the real sample environment, without post-processing.
arXiv Detail & Related papers (2023-07-03T15:51:39Z) - UnLoc: A Universal Localization Method for Autonomous Vehicles using
LiDAR, Radar and/or Camera Input [51.150605800173366]
UnLoc is a novel unified neural modeling approach for localization with multi-sensor input in all weather conditions.
Our method is extensively evaluated on Oxford Radar RobotCar, ApolloSouthBay and Perth-WA datasets.
arXiv Detail & Related papers (2023-07-03T04:10:55Z) - Mixed-domain Training Improves Multi-Mission Terrain Segmentation [0.9566312408744931]
Current Martian terrain segmentation models require retraining for deployment across different domains.
This research proposes a semi-supervised learning approach that leverages unsupervised contrastive pretraining of a backbone for a multi-mission semantic segmentation for Martian surfaces.
arXiv Detail & Related papers (2022-09-27T20:25:24Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - Deep Learning for Real Time Satellite Pose Estimation on Low Power Edge
TPU [58.720142291102135]
In this paper we propose a pose estimation software exploiting neural network architectures.
We show how low power machine learning accelerators could enable Artificial Intelligence exploitation in space.
arXiv Detail & Related papers (2022-04-07T08:53:18Z) - Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for
Multi-Robot Systems [92.26462290867963]
Kimera-Multi is the first multi-robot system that is robust and capable of identifying and rejecting incorrect inter and intra-robot loop closures.
We demonstrate Kimera-Multi in photo-realistic simulations, SLAM benchmarking datasets, and challenging outdoor datasets collected using ground robots.
arXiv Detail & Related papers (2021-06-28T03:56:40Z) - Kimera-Multi: a System for Distributed Multi-Robot Metric-Semantic
Simultaneous Localization and Mapping [57.173793973480656]
We present the first fully distributed multi-robot system for dense metric-semantic SLAM.
Our system, dubbed Kimera-Multi, is implemented by a team of robots equipped with visual-inertial sensors.
Kimera-Multi builds a 3D mesh model of the environment in real-time, where each face of the mesh is annotated with a semantic label.
arXiv Detail & Related papers (2020-11-08T21:38:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.