ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
- URL: http://arxiv.org/abs/2412.00153v2
- Date: Wed, 04 Dec 2024 12:40:30 GMT
- Title: ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
- Authors: Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei,
- Abstract summary: We propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation.
Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously.
- Score: 75.750699619993
- License:
- Abstract: Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.
Related papers
- Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation.
We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation.
FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation [47.0028071183214]
FrozenSeg is designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP)
FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner.
arXiv Detail & Related papers (2024-09-05T13:36:50Z) - MaskUno: Switch-Split Block For Enhancing Instance Segmentation [0.0]
We propose replacing mask prediction with a Switch-Split block that processes refined ROIs, classifies them, and assigns them to specialized mask predictors.
An increase in the mean Average Precision (mAP) of 2.03% was observed for the high-performing DetectoRS when trained on 80 classes.
arXiv Detail & Related papers (2024-07-31T10:12:14Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Open-vocabulary Panoptic Segmentation with Embedding Modulation [71.15502078615587]
Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world.
Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results.
We propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panopticon.
arXiv Detail & Related papers (2023-03-20T17:58:48Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - EGMM: an Evidential Version of the Gaussian Mixture Model for Clustering [22.586481334904793]
We propose a new model-based clustering algorithm, called EGMM (evidential GMM), in the theoretical framework of belief functions.
The parameters in EGMM are estimated by a specially designed Expectation-Maximization (EM) algorithm.
The proposed EGMM is as simple as the classical GMM, but can generate a more informative evidential partition for the considered dataset.
arXiv Detail & Related papers (2020-10-03T11:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.