Related papers: Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments

Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments

URL: http://arxiv.org/abs/2411.06632v1
Date: Sun, 10 Nov 2024 23:52:24 GMT
Title: Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments
Authors: Deegan Atha, Xianmei Lei, Shehryar Khattak, Anna Sabel, Elle Miller, Aurelio Noca, Grace Lim, Jeffrey Edlund, Curtis Padgett, Patrick Spieler,
Abstract summary: Off-road environments pose significant perception challenges for high-speed autonomous navigation. We propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (500 images), sparse and coarsely labeled (30% pixels) multi-biome dataset. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map.
Score: 4.106846770364469
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Off-road environments pose significant perception challenges for high-speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain-shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (<500 images), sparse and coarsely labeled (<30% pixels) multi-biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map. We demonstrate zero-shot out-of-biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few-shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range-based semantic fusion approach to handle common off-road hazards like pop-up hazards, overhangs, and water features.

Related papers

OmniUnet: A Multimodal Network for Unstructured Terrain Segmentation on Planetary Rovers Using RGB, Depth, and Thermal Imagery [0.5837061763460748]
This work presents OmniUnet, a transformer-based neural network architecture for semantic segmentation using RGB, depth, and thermal imagery.<n>A custom multimodal sensor housing was developed using 3D printing and mounted on the Martian Rover Testbed for Autonomy.<n>A subset of this dataset was manually labeled to support supervised training of the network.<n>Inference tests yielded an average prediction time of 673 ms on a resource-constrained computer.
arXiv Detail & Related papers (2025-08-01T12:23:29Z)
Learning to Drive Anywhere with Model-Based Reannotation [49.80796496905606]
We develop a framework for generalizable visual navigation policies for robots.<n>We leverage passively collected data, including crowd-sourced teleoperation data and unlabeled YouTube videos.<n>This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints.
arXiv Detail & Related papers (2025-05-08T18:43:39Z)
GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes [5.289647064481469]
We present GraspClutter6D, a large-scale real-world grasping dataset featuring 1,000 cluttered scenes with dense arrangements. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. We validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments.
arXiv Detail & Related papers (2025-04-09T13:15:46Z)
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving [7.007334645975593]
We introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs with precise localization abilities of specialist detection models.
arXiv Detail & Related papers (2025-03-28T13:55:16Z)
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding [39.967352995143855]
GroundingSuite aims to bridge the gap between vision and language modalities. It comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images.
arXiv Detail & Related papers (2025-03-13T17:43:10Z)
COARSE: Collaborative Pseudo-Labeling with Coarse Real Labels for Off-Road Semantic Segmentation [49.267650162344765]
COARSE is a semi-supervised domain adaptation framework for off-road semantic segmentation. We bridge domain gaps with complementary pixel-level and patch-level decoders, enhanced by a collaborative pseudo-labeling strategy.
arXiv Detail & Related papers (2025-03-05T22:25:54Z)
Neural Semantic Map-Learning for Autonomous Vehicles [85.8425492858912]
We present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment. Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field. We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction.
arXiv Detail & Related papers (2024-10-10T10:10:03Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
Fewshot learning on global multimodal embeddings for earth observation tasks [5.057850174013128]
We pretrain a CLIP/ViT based model using three different modalities of satellite imagery covering over 10% of Earth's total landmass. We use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation. We visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.
arXiv Detail & Related papers (2023-09-29T20:15:52Z)
Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization. We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain. It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z)
Pyramid Fusion Transformer for Semantic Segmentation [44.57867861592341]
We propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. We achieve competitive performance on three widely used semantic segmentation datasets.
arXiv Detail & Related papers (2022-01-11T16:09:25Z)
A Real-Time Online Learning Framework for Joint 3D Reconstruction and Semantic Segmentation of Indoor Scenes [87.74952229507096]
This paper presents a real-time online vision framework to jointly recover an indoor scene's 3D structure and semantic label. Given noisy depth maps, a camera trajectory, and 2D semantic labels at train time, the proposed neural network learns to fuse the depth over frames with suitable semantic labels in the scene space.
arXiv Detail & Related papers (2021-08-11T14:29:01Z)
EagerMOT: 3D Multi-Object Tracking via Sensor Fusion [68.8204255655161]
Multi-object tracking (MOT) enables mobile robots to perform well-informed motion planning and navigation by localizing surrounding objects in 3D space and time. Existing methods rely on depth sensors (e.g., LiDAR) to detect and track targets in 3D space, but only up to a limited sensing range due to the sparsity of the signal. We propose EagerMOT, a simple tracking formulation that integrates all available object observations from both sensor modalities to obtain a well-informed interpretation of the scene dynamics.
arXiv Detail & Related papers (2021-04-29T22:30:29Z)
GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments [54.21959527308051]
We present a new learning-based method for identifying safe and navigable regions in off-road terrains and unstructured environments from RGB images. Our approach consists of classifying groups of terrain classes based on their navigability levels using coarse-grained semantic segmentation. We show through extensive evaluations on the RUGD and RELLIS-3D datasets that our learning algorithm improves the accuracy of visual perception in off-road terrains for navigation.
arXiv Detail & Related papers (2021-03-07T02:16:24Z)
RELLIS-3D Dataset: Data, Benchmarks and Analysis [16.803548871633957]
RELLIS-3D is a multimodal dataset collected in an off-road environment. The data was collected on the Rellis Campus of Texas A&M University.
arXiv Detail & Related papers (2020-11-17T18:28:01Z)
LiDAR guided Small obstacle Segmentation [14.880698940693609]
Small obstacles on the road are critical for autonomous driving. We present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR and Monocular vision. We show significant performance gains when the context is fed as an additional input to monocular semantic segmentation frameworks.
arXiv Detail & Related papers (2020-03-12T18:34:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.