MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation
- URL: http://arxiv.org/abs/2308.04829v2
- Date: Wed, 13 Mar 2024 03:25:32 GMT
- Title: MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation
- Authors: Kaixin Cai, Pengzhen Ren, Yi Zhu, Hang Xu, Jianzhuang Liu, Changlin
Li, Guangrun Wang, Xiaodan Liang
- Abstract summary: We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
- Score: 110.09800389100599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, semantic segmentation models trained with image-level text
supervision have shown promising results in challenging open-world scenarios.
However, these models still face difficulties in learning fine-grained semantic
alignment at the pixel level and predicting accurate object masks. To address
this issue, we propose MixReorg, a novel and straightforward pre-training
paradigm for semantic segmentation that enhances a model's ability to
reorganize patches mixed across images, exploring both local visual relevance
and global semantic coherence. Our approach involves generating fine-grained
patch-text pairs data by mixing image patches while preserving the
correspondence between patches and text. The model is then trained to minimize
the segmentation loss of the mixed images and the two contrastive losses of the
original and restored features. With MixReorg as a mask learner, conventional
text-supervised semantic segmentation models can achieve highly generalizable
pixel-semantic alignment ability, which is crucial for open-world segmentation.
After training with large-scale image-text data, MixReorg models can be applied
directly to segment visual objects of arbitrary categories, without the need
for further fine-tuning. Our proposed framework demonstrates strong performance
on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT
by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012,
PASCAL Context, MS COCO, and ADE20K, respectively.
Related papers
- Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals [15.258631373740686]
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global semantic categories within an image corpus without any form of annotation.
We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation.
This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a expectation-maximization algorithm, PriMaPs-EM.
arXiv Detail & Related papers (2024-04-25T17:58:09Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - FuseNet: Self-Supervised Dual-Path Network for Medical Image
Segmentation [3.485615723221064]
FuseNet is a dual-stream framework for self-supervised semantic segmentation.
Cross-modal fusion technique extends the principles of CLIP by replacing textual data with augmented images.
experiments on skin lesion and lung segmentation datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T00:03:16Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - Learning Open-vocabulary Semantic Segmentation Models From Natural
Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories.
We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training.
Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.