Few-Shot Panoptic Segmentation With Foundation Models
- URL: http://arxiv.org/abs/2309.10726v3
- Date: Fri, 1 Mar 2024 13:48:34 GMT
- Title: Few-Shot Panoptic Segmentation With Foundation Models
- Authors: Markus K\"appeler, K\"ursat Petek, Niclas V\"odisch, Wolfram Burgard,
Abhinav Valada
- Abstract summary: We propose to leverage task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO)
In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation.
We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method.
- Score: 23.231014713335664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art methods for panoptic segmentation require an immense
amount of annotated training data that is both arduous and expensive to obtain
posing a significant challenge for their widespread adoption. Concurrently,
recent breakthroughs in visual representation learning have sparked a paradigm
shift leading to the advent of large foundation models that can be trained with
completely unlabeled images. In this work, we propose to leverage such
task-agnostic image features to enable few-shot panoptic segmentation by
presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In
detail, our method combines a DINOv2 backbone with lightweight network heads
for semantic segmentation and boundary estimation. We show that our approach,
albeit being trained with only ten annotated images, predicts high-quality
pseudo-labels that can be used with any existing panoptic segmentation method.
Notably, we demonstrate that SPINO achieves competitive results compared to
fully supervised baselines while using less than 0.3% of the ground truth
labels, paving the way for learning complex visual recognition tasks leveraging
foundation models. To illustrate its general applicability, we further deploy
SPINO on real-world robotic vision systems for both outdoor and indoor
environments. To foster future research, we make the code and trained models
publicly available at http://spino.cs.uni-freiburg.de.
Related papers
- Freestyle Sketch-in-the-Loop Image Segmentation [116.1810651297801]
We introduce a "sketch-in-the-loop" image segmentation framework, enabling the segmentation of visual concepts partially, completely, or in groupings.
This framework capitalises on the synergy between sketch-based image retrieval models and large-scale pre-trained models.
Our purpose-made augmentation strategy enhances the versatility of our sketch-guided mask generation, allowing segmentation at multiple levels.
arXiv Detail & Related papers (2025-01-27T13:07:51Z) - Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.
We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z) - A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation [22.440065488051047]
Key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data.
We exploit the groundwork paved by visual foundation models to train two lightweight network heads for semantic segmentation and object boundary detection.
We demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations.
arXiv Detail & Related papers (2024-05-29T12:23:29Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - Image Understands Point Cloud: Weakly Supervised 3D Semantic
Segmentation via Association Learning [59.64695628433855]
We propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images.
Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels.
Our method even outperforms the state-of-the-art fully supervised competitors with less than 1% actively selected annotations.
arXiv Detail & Related papers (2022-09-16T07:59:04Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.