CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for
Referring Image Segmentation
- URL: http://arxiv.org/abs/2212.01769v1
- Date: Sun, 4 Dec 2022 08:53:42 GMT
- Title: CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for
Referring Image Segmentation
- Authors: Zicheng Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang, Wei Ke
- Abstract summary: Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence.
Previous works learn to straightforwardly align the sentence embedding and pixel-level embedding for highlighting the referred objects.
We propose CoupAlign, a simple yet effective multi-level visual-semantic alignment method.
- Score: 104.5033800500497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims at localizing all pixels of the visual
objects described by a natural language sentence. Previous works learn to
straightforwardly align the sentence embedding and pixel-level embedding for
highlighting the referred objects, but ignore the semantic consistency of
pixels within the same object, leading to incomplete masks and localization
errors in predictions. To tackle this problem, we propose CoupAlign, a simple
yet effective multi-level visual-semantic alignment method, to couple
sentence-mask alignment with word-pixel alignment to enforce object mask
constraint for achieving more accurate localization and segmentation.
Specifically, the Word-Pixel Alignment (WPA) module performs early fusion of
linguistic and pixel-level features in intermediate layers of the vision and
language encoders. Based on the word-pixel aligned embedding, a set of mask
proposals are generated to hypothesize possible objects. Then in the
Sentence-Mask Alignment (SMA) module, the masks are weighted by the sentence
embedding to localize the referred object, and finally projected back to
aggregate the pixels for the target. To further enhance the learning of the two
alignment modules, an auxiliary loss is designed to contrast the foreground and
background pixels. By hierarchically aligning pixels and masks with linguistic
features, our CoupAlign captures the pixel coherence at both visual and
semantic levels, thus generating more accurate predictions. Extensive
experiments on popular datasets (e.g., RefCOCO and G-Ref) show that our method
achieves consistent improvements over state-of-the-art methods, e.g., about 2%
oIoU increase on the validation and testing set of RefCOCO. Especially,
CoupAlign has remarkable ability in distinguishing the target from multiple
objects of the same class.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment [53.235290505274676]
Large-scale vision-language models such as CLIP can improve semantic segmentation performance.
We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment.
MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
arXiv Detail & Related papers (2024-07-31T14:56:42Z) - LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search [16.7500024682162]
This paper proposes the Local Alignment from Image-Phrase modeling (LAIP) framework, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module.
Experiments conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the superiority of the LAIP framework over existing methods.
arXiv Detail & Related papers (2024-06-16T08:37:24Z) - Variance-insensitive and Target-preserving Mask Refinement for
Interactive Image Segmentation [68.16510297109872]
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing.
We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs.
Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
arXiv Detail & Related papers (2023-12-22T02:31:31Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Unmasking Anomalies in Road-Scene Segmentation [18.253109627901566]
Anomaly segmentation is a critical task for driving applications.
We propose a paradigm change by shifting from a per-pixel classification to a mask classification.
Mask2Anomaly demonstrates the feasibility of integrating an anomaly detection method in a mask-classification architecture.
arXiv Detail & Related papers (2023-07-25T08:23:10Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Evidential fully convolutional network for semantic segmentation [6.230751621285322]
We propose a hybrid architecture composed of a fully convolutional network (FCN) and a Dempster-Shafer layer for image semantic segmentation.
Experiments show that the proposed combination improves the accuracy and calibration of semantic segmentation by assigning confusing pixels to multi-class sets.
arXiv Detail & Related papers (2021-03-25T01:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.