Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation
- URL: http://arxiv.org/abs/2409.16278v1
- Date: Tue, 24 Sep 2024 17:50:28 GMT
- Title: Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation
- Authors: Yong Xien Chng, Xuchong Qiu, Yizeng Han, Kai Ding, Wan Ding, Gao Huang,
- Abstract summary: Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks.
mask classification is the main performance bottleneck for open-vocab panoptic segmentation.
We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
- Score: 42.020470627552136
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks based on a set of texts. Despite existing efforts, it remains challenging to develop a high-performing method that generalizes effectively across new domains and requires minimal training resources. Our in-depth analysis of current methods reveals a crucial insight: mask classification is the main performance bottleneck for open-vocab. panoptic segmentation. Based on this, we propose Semantic Refocused Tuning (SMART), a novel framework that greatly enhances open-vocab. panoptic segmentation by improving mask classification through two key innovations. First, SMART adopts a multimodal Semantic-guided Mask Attention mechanism that injects task-awareness into the regional information extraction process. This enables the model to capture task-specific and contextually relevant information for more effective mask classification. Second, it incorporates Query Projection Tuning, which strategically fine-tunes the query projection layers within the Vision Language Model (VLM) used for mask classification. This adjustment allows the model to adapt the image focus of mask tokens to new distributions with minimal training resources, while preserving the VLM's pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, SMART sets new state-of-the-art results, demonstrating improvements of up to +1.3 PQ and +5.4 mIoU across representative benchmarks, while reducing training costs by nearly 10x compared to the previous best method. Our code and data will be released.
Related papers
- Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency.
Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z) - FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation [47.0028071183214]
FrozenSeg is designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP)
FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner.
arXiv Detail & Related papers (2024-09-05T13:36:50Z) - MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation [54.98688607911399]
We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs)
Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries.
Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
arXiv Detail & Related papers (2024-05-30T15:16:06Z) - Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization [40.78236375917571]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - Applying Unsupervised Semantic Segmentation to High-Resolution UAV Imagery for Enhanced Road Scene Parsing [12.558144256470827]
A novel unsupervised road parsing framework is presented.
The proposed method achieves a mean Intersection over Union (mIoU) of 89.96% on the development dataset without any manual annotation.
arXiv Detail & Related papers (2024-02-05T13:16:12Z) - Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
Mask Annotations [86.47908754383198]
Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories.
Our method generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs.
Our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset.
arXiv Detail & Related papers (2023-03-29T17:58:39Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Improving self-supervised representation learning via sequential
adversarial masking [12.176299580413097]
Masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision.
We propose a new framework that generates masks in a sequential fashion with different constraints on the adversary.
arXiv Detail & Related papers (2022-12-16T04:25:43Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training.
We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs)
We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z) - Intelligent Masking: Deep Q-Learning for Context Encoding in Medical
Image Analysis [48.02011627390706]
We develop a novel self-supervised approach that occludes targeted regions to improve the pre-training procedure.
We show that training the agent against the prediction model can significantly improve the semantic features extracted for downstream classification tasks.
arXiv Detail & Related papers (2022-03-25T19:05:06Z) - PFENet++: Boosting Few-shot Semantic Segmentation with the
Noise-filtered Context-aware Prior Mask [62.37727055343632]
We revisit the prior mask guidance proposed in Guided Feature Enrichment Network for Few-Shot''
We propose the Context-aware Prior Mask (CAPM) that leverages additional nearby semantic cues for better locating the objects in query images.
We take one step further by incorporating a lightweight Noise Suppression Module (NSM) to screen out the unnecessary responses.
arXiv Detail & Related papers (2021-09-28T15:07:43Z) - Per-Pixel Classification is Not All You Need for Semantic Segmentation [184.2905747595058]
Mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks.
We propose MaskFormer, a simple mask classification model which predicts a set of binary masks.
Our method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
arXiv Detail & Related papers (2021-07-13T17:59:50Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.