Generalization Boosted Adapter for Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2409.08468v1
- Date: Fri, 13 Sep 2024 01:49:12 GMT
- Title: Generalization Boosted Adapter for Open-Vocabulary Segmentation
- Authors: Wenhao Xu, Changwei Wang, Xuxiang Feng, Rongtao Xu, Longzhao Huang, Zherui Zhang, Li Guo, Shibiao Xu,
- Abstract summary: Generalization Boosted Adapter (GBA) is a novel adapter strategy that enhances the generalization and robustness of vision-language models.
As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods.
- Score: 15.91026999425076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object recognition capabilities, motivating their adaptation for dense prediction tasks like segmentation. However, directly applying VLMs to such tasks remains challenging due to their lack of pixel-level granularity and the limited data available for fine-tuning, leading to overfitting and poor generalization. To address these limitations, we propose Generalization Boosted Adapter (GBA), a novel adapter strategy that enhances the generalization and robustness of VLMs for open-vocabulary segmentation. GBA comprises two core components: (1) a Style Diversification Adapter (SDA) that decouples features into amplitude and phase components, operating solely on the amplitude to enrich the feature space representation while preserving semantic consistency; and (2) a Correlation Constraint Adapter (CCA) that employs cross-attention to establish tighter semantic associations between text categories and target regions, suppressing irrelevant low-frequency ``noise'' information and avoiding erroneous associations. Through the synergistic effect of the shallow SDA and the deep CCA, GBA effectively alleviates overfitting issues and enhances the semantic relevance of feature representations. As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods, demonstrating broad applicability and achieving state-of-the-art performance on multiple open-vocabulary segmentation benchmarks.
Related papers
- MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone.
MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation [32.852004564832455]
Open-vocabulary semantic segmentation requires models to integrate visual representations with semantic labels.
This paper introduces ProxyCLIP, a framework designed to harmonize the strengths of Contrastive Language-Image Pre-training (CLIP) and Vision Foundation Models (VFMs)
As a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4.
arXiv Detail & Related papers (2024-08-09T06:17:00Z) - Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation [79.66299178949257]
Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions.
vision-language foundation models, especially CLIP, have emerged as powerful tools for acquiring open-vocabulary capabilities.
H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.
arXiv Detail & Related papers (2024-05-29T07:41:34Z) - Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation [19.20874993309959]
vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks.
We propose a baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP)
Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature.
arXiv Detail & Related papers (2024-04-12T01:08:04Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - A Cross-Scale Hierarchical Transformer with Correspondence-Augmented
Attention for inferring Bird's-Eye-View Semantic Segmentation [13.013635162859108]
Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing.
We propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring.
Our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
arXiv Detail & Related papers (2023-04-07T13:52:47Z) - VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding.
VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training.
Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z) - HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones.
We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains.
Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.