Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision
- URL: http://arxiv.org/abs/2403.03707v1
- Date: Wed, 6 Mar 2024 13:43:36 GMT
- Title: Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision
- Authors: Yajie Liu, Pu Ge, Qingjie Liu, Di Huang
- Abstract summary: We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
- Score: 23.931443799102663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, learning open-vocabulary semantic segmentation from text
supervision has achieved promising downstream performance. Nevertheless,
current approaches encounter an alignment granularity gap owing to the absence
of dense annotations, wherein they learn coarse image/region-text alignment
during training yet perform group/pixel-level predictions at inference. Such
discrepancy leads to suboptimal learning efficiency and inferior zero-shot
segmentation results. In this paper, we introduce a Multi-Grained Cross-modal
Alignment (MGCA) framework, which explicitly learns pixel-level alignment along
with object- and region-level alignment to bridge the granularity gap without
any dense annotations. Specifically, MGCA ingeniously constructs pseudo
multi-granular semantic correspondences upon image-text pairs and collaborates
with hard sampling strategies to facilitate fine-grained cross-modal
contrastive learning. Further, we point out the defects of existing group and
pixel prediction units in downstream segmentation and develop an adaptive
semantic unit which effectively mitigates their dilemmas including under- and
over-segmentation. Training solely on CC3M, our method achieves significant
advancements over state-of-the-art methods, demonstrating its effectiveness and
efficiency.
Related papers
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Contextrast: Contextual Contrastive Learning for Semantic Segmentation [9.051352746190448]
We propose Contextrast, a contrastive learning-based semantic segmentation method.
Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative sampling.
We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks.
arXiv Detail & Related papers (2024-04-16T15:04:55Z) - Associating Spatially-Consistent Grouping with Text-supervised Semantic
Segmentation [117.36746226803993]
We introduce self-supervised spatially-consistent grouping with text-supervised semantic segmentation.
Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition.
Our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks.
arXiv Detail & Related papers (2023-04-03T16:24:39Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z) - Margin Preserving Self-paced Contrastive Learning Towards Domain
Adaptation for Medical Image Segmentation [51.93711960601973]
We propose a novel margin preserving self-paced contrastive Learning model for cross-modal medical image segmentation.
With the guidance of progressively refined semantic prototypes, a novel margin preserving contrastive loss is proposed to boost the discriminability of embedded representation space.
Experiments on cross-modal cardiac segmentation tasks demonstrate that MPSCL significantly improves semantic segmentation performance.
arXiv Detail & Related papers (2021-03-15T15:23:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.