Related papers: Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

URL: http://arxiv.org/abs/2402.08960v2
Date: Tue, 11 Jun 2024 17:01:02 GMT
Title: Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
Authors: Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu,
Abstract summary: Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework. It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
Score: 87.15580604023555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs of masks and text entities. We then train a feature adapter to align region embeddings with text embeddings based on these pseudo labels, achieving open-vocabulary segmentation. However, the inherent noise in the mask-entity correspondence poses a challenge to obtaining reliable pairs. To address this, we employ a vision-language large model to re-caption the input images and extract precise entities, and we design a multi-scale matching strategy to reduce noisy mask-entity pairs. Our Unpair-Seg framework demonstrates impressive performance, achieving 14.6\% and 19.5\% mIoU on the ADE-847 and PASCAL Context-459 datasets, significantly narrowing the gap between fully-supervised and weakly-supervised methods.

Related papers

SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z)
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models [6.0870128457015715]
We show that cross-attention alone provides very coarse object localization, which however can provide initial seeds.<n>We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences.<n>Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion.
arXiv Detail & Related papers (2025-07-26T05:44:00Z)
SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation. Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z)
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets. In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text. We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z)
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework. TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing. The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z)
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z)
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation [16.900404701997502]
We propose a GAN-based approach that generates images conditioned on latent masks. We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner. It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
arXiv Detail & Related papers (2021-12-02T07:57:56Z)
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.