Related papers: VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

URL: http://arxiv.org/abs/2507.20397v1
Date: Sun, 27 Jul 2025 19:39:29 GMT
Title: VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving
Authors: Levente Tempfli, Esteban Rivera, Markus Lienkamp,
Abstract summary: We introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images.<n> VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps.<n>On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection.
Score: 1.623951368574041
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.

Related papers

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection [16.09503890891102]
We propose an unsupervised 3D detection approach that operates exclusively on LiDAR point clouds. We exploit the inherent CLI-temporal knowledge of LiDAR point clouds for clustering, tracking, as well as boxtext and label refinement. Our approach outperforms state-of-the-art unsupervised 3D object detectors on the Open dataset.
arXiv Detail & Related papers (2024-08-07T14:14:53Z)
VRSO: Visual-Centric Reconstruction for Static Object Annotation [21.70421057949981]
This paper introduces VRSO, a visual-centric approach for static object annotation. VRSO is distinguished in low cost, high efficiency, and high quality. It recovers static objects in 3D space with only camera images as input.
arXiv Detail & Related papers (2024-03-22T08:16:59Z)
Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments [67.83787474506073]
We tackle the limitations of current LiDAR-based 3D object detection systems. We introduce a universal textscFind n' Propagate approach for 3D OV tasks. We achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes.
arXiv Detail & Related papers (2024-03-20T12:51:30Z)
MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection [59.1417156002086]
MixSup is a more practical paradigm simultaneously utilizing massive cheap coarse labels and a limited number of accurate labels for Mixed-grained Supervision. MixSup achieves up to 97.31% of fully supervised performance, using cheap cluster annotations and only 10% box annotations.
arXiv Detail & Related papers (2024-01-29T17:05:19Z)
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving [39.70689418558153]
We present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants.
arXiv Detail & Related papers (2023-09-25T19:33:52Z)
BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving. Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation. We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z)
UpCycling: Semi-supervised 3D Object Detection without Sharing Raw-level Unlabeled Scenes [7.32610370107512]
UpCycling is a novel SSL framework for 3D object detection with zero additional raw-level point cloud. We introduce hybrid pseudo labels, feature-level Ground Truth sampling (F-GT) and Rotation (F-RoT) UpCycling significantly outperforms the state-of-the-art SSL methods that utilize raw-point scenes.
arXiv Detail & Related papers (2022-11-22T02:04:09Z)
LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds [62.49198183539889]
We propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning. Our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
arXiv Detail & Related papers (2022-10-14T19:13:36Z)
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation [70.75100533512021]
In this paper, we formulate the label uncertainty problem as the diversity of potentially plausible bounding boxes of objects. We propose GLENet, a generative framework adapted from conditional variational autoencoders, to model the one-to-many relationship between a typical 3D object and its potential ground-truth bounding boxes with latent variables. The label uncertainty generated by GLENet is a plug-and-play module and can be conveniently integrated into existing deep 3D detectors.
arXiv Detail & Related papers (2022-07-06T06:26:17Z)
Exploring Diversity-based Active Learning for 3D Object Detection in Autonomous Driving [45.405303803618]
We investigate diversity-based active learning (AL) as a potential solution to alleviate the annotation burden. We propose a novel acquisition function that enforces spatial and temporal diversity in the selected samples. We demonstrate the effectiveness of the proposed method on the nuScenes dataset and show that it outperforms existing AL strategies significantly.
arXiv Detail & Related papers (2022-05-16T14:21:30Z)
Unsupervised Object Detection with LiDAR Clues [70.73881791310495]
We present the first practical method for unsupervised object detection with the aid of LiDAR clues. In our approach, candidate object segments based on 3D point clouds are firstly generated. Then, an iterative segment labeling process is conducted to assign segment labels and to train a segment labeling network. The labeling process is carefully designed so as to mitigate the issue of long-tailed and open-ended distribution.
arXiv Detail & Related papers (2020-11-25T18:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.