Related papers: LOSC: LiDAR Open-voc Segmentation Consolidator

LOSC: LiDAR Open-voc Segmentation Consolidator

URL: http://arxiv.org/abs/2507.07605v1
Date: Thu, 10 Jul 2025 10:10:13 GMT
Title: LOSC: LiDAR Open-voc Segmentation Consolidator
Authors: Nermin Samet, Gilles Puy, Renaud Marlet,
Abstract summary: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings.
Score: 15.046470253884694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.

Related papers

PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum [20.206273757144547]
PGOV3D is a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation.<n>We pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry.<n>In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex.
arXiv Detail & Related papers (2025-06-30T08:13:07Z)
Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding [87.68271178167373]
We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone.<n>S4Token is a tokenization pipeline that produces semantically-informed tokens regardless of scene scale.
arXiv Detail & Related papers (2025-05-24T18:26:30Z)
Label-Efficient LiDAR Panoptic Segmentation [22.440065488051047]
Limited-Label LiDAR Panoptic (L3PS)<n>We develop a label-efficient 2D network to generate panoptic pseudo-labels from annotated images.<n>We then introduce a novel 3D refinement module that capitalizes on the geometric properties of point clouds.
arXiv Detail & Related papers (2025-03-04T07:58:15Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation [20.7179907935644]
3D-AVS is a method for Auto-Vocabulary of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime.<n>3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary.<n>Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions.
arXiv Detail & Related papers (2024-06-13T13:59:47Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning. We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding. We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z)
Image Understands Point Cloud: Weakly Supervised 3D Semantic Segmentation via Association Learning [59.64695628433855]
We propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images. Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels. Our method even outperforms the state-of-the-art fully supervised competitors with less than 1% actively selected annotations.
arXiv Detail & Related papers (2022-09-16T07:59:04Z)
Box2Seg: Learning Semantics of 3D Point Clouds with Box-Level Supervision [65.19589997822155]
We introduce a neural architecture, termed Box2Seg, to learn point-level semantics of 3D point clouds with bounding box-level supervision. We show that the proposed network can be trained with cheap, or even off-the-shelf bounding box-level annotations and subcloud-level tags.
arXiv Detail & Related papers (2022-01-09T09:07:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.