Towards Label-free Scene Understanding by Vision Foundation Models
- URL: http://arxiv.org/abs/2306.03899v2
- Date: Mon, 30 Oct 2023 15:27:57 GMT
- Title: Towards Label-free Scene Understanding by Vision Foundation Models
- Authors: Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen, Xinge Zhu,
Yuexin Ma, Tongliang Liu, Wenping Wang
- Abstract summary: We investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data.
We propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously.
Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively.
- Score: 87.13117617056004
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision foundation models such as Contrastive Vision-Language Pre-training
(CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot
performance on image classification and segmentation tasks. However, the
incorporation of CLIP and SAM for label-free scene understanding has yet to be
explored. In this paper, we investigate the potential of vision foundation
models in enabling networks to comprehend 2D and 3D worlds without labelled
data. The primary challenge lies in effectively supervising networks under
extremely noisy pseudo labels, which are generated by CLIP and further
exacerbated during the propagation from the 2D to the 3D domain. To tackle
these challenges, we propose a novel Cross-modality Noisy Supervision (CNS)
method that leverages the strengths of CLIP and SAM to supervise 2D and 3D
networks simultaneously. In particular, we introduce a prediction consistency
regularization to co-train 2D and 3D networks, then further impose the
networks' latent space consistency using the SAM's robust feature
representation. Experiments conducted on diverse indoor and outdoor datasets
demonstrate the superior performance of our method in understanding 2D and 3D
open environments. Our 2D and 3D network achieves label-free semantic
segmentation with 28.4\% and 33.5\% mIoU on ScanNet, improving 4.7\% and 7.9\%,
respectively. For nuImages and nuScenes datasets, the performance is 22.1\% and
26.8\% with improvements of 3.5\% and 6.0\%, respectively. Code is available.
(https://github.com/runnanchen/Label-Free-Scene-Understanding).
Related papers
- Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision.
densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive.
Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z) - Image Understands Point Cloud: Weakly Supervised 3D Semantic
Segmentation via Association Learning [59.64695628433855]
We propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images.
Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels.
Our method even outperforms the state-of-the-art fully supervised competitors with less than 1% actively selected annotations.
arXiv Detail & Related papers (2022-09-16T07:59:04Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point
Clouds [0.16799377888527683]
We present S3CNet, a sparse convolution based neural network that predicts the semantically completed scene from a single, unified LiDAR point cloud.
We show that our proposed method outperforms all counterparts on the 3D task, achieving state-of-the art results on the Semantic KITTI benchmark.
arXiv Detail & Related papers (2020-12-16T20:14:41Z) - Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point
Clouds of Wild Scenes [36.07733308424772]
The deficiency of 3D segmentation labels is one of the main obstacles to effective point cloud segmentation.
We propose a novel deep graph convolutional network-based framework for large-scale semantic scene segmentation in point clouds with sole 2D supervision.
arXiv Detail & Related papers (2020-04-26T23:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.