Related papers: MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

URL: http://arxiv.org/abs/2407.21654v1
Date: Wed, 31 Jul 2024 14:56:42 GMT
Title: MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
Authors: Anurag Das, Xinting Hu, Li Jiang, Bernt Schiele,
Abstract summary: Large-scale vision-language models such as CLIP can improve semantic segmentation performance. We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment. MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
Score: 53.235290505274676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent approaches have shown that large-scale vision-language models such as CLIP can improve semantic segmentation performance. These methods typically aim for pixel-level vision-language alignment, but often rely on low resolution image features from CLIP, resulting in class ambiguities along boundaries. Moreover, the global scene representations in CLIP text embeddings do not directly correlate with the local and detailed pixel-level features, making meaningful alignment more difficult. To address these limitations, we introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment. Specifically, we first propose Mask-Text Decoder that enhances the mask representations using rich textual data with the CLIP language model. Subsequently, it aligns mask representations with text embeddings using Mask-to-Text Contrastive Learning. Furthermore, we introduce MaskText Prompt Learning, utilizing multiple context-specific prompts for text embeddings to capture diverse class representations across masks. Overall, MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on on standard benchmark datasets, ADE20k and Cityscapes, respectively.

Related papers

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning [31.386303698437214]
We propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions.<n>We replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework.<n>Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-11-06T17:54:12Z)
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs [0.351124620232225]
FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
arXiv Detail & Related papers (2025-04-02T17:19:59Z)
Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM) AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z)
High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation [109.19165503929992]
We present MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z)
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) Our approach introduces canonical class-aware glyph masks to suppress background and text style noise. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z)
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment. We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z)
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models. An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z)
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z)
CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation [104.5033800500497]
Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence. Previous works learn to straightforwardly align the sentence embedding and pixel-level embedding for highlighting the referred objects. We propose CoupAlign, a simple yet effective multi-level visual-semantic alignment method.
arXiv Detail & Related papers (2022-12-04T08:53:42Z)
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs [10.484851004093919]
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts. We propose a novel Text-grounded Contrastive Learning framework that enables a model to directly learn region-text alignment.
arXiv Detail & Related papers (2022-12-01T18:59:03Z)
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [45.81698881151867]
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-
arXiv Detail & Related papers (2022-10-09T02:57:32Z)
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z)
Open-Vocabulary Universal Image Segmentation with MaskCLIP [24.74805434602145]
We tackle an emerging computer vision task, open-vocabulary universal image segmentation. We first build a baseline method by directly adopting pre-trained CLIP models. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
arXiv Detail & Related papers (2022-08-18T17:55:37Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.