Segment Everything Everywhere All at Once
- URL: http://arxiv.org/abs/2304.06718v4
- Date: Tue, 11 Jul 2023 18:13:14 GMT
- Title: Segment Everything Everywhere All at Once
- Authors: Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng
Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee
- Abstract summary: We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
- Score: 124.90835636901096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present SEEM, a promptable and interactive model for
segmenting everything everywhere all at once in an image, as shown in Fig.1. In
SEEM, we propose a novel decoding mechanism that enables diverse prompting for
all types of segmentation tasks, aiming at a universal segmentation interface
that behaves like large language models (LLMs). More specifically, SEEM is
designed with four desiderata: i) Versatility. We introduce a new visual prompt
to unify different spatial queries including points, boxes, scribbles and
masks, which can further generalize to a different referring image; ii)
Compositionality. We learn a joint visual-semantic space between text and
visual prompts, which facilitates the dynamic composition of two prompt types
required for various segmentation tasks; iii) Interactivity. We further
incorporate learnable memory prompts into the decoder to retain segmentation
history through mask-guided cross-attention from decoder to image features; and
iv) Semantic-awareness. We use a text encoder to encode text queries and mask
labels into the same semantic space for open-vocabulary segmentation. We
conduct a comprehensive empirical study to validate the effectiveness of SEEM
across diverse segmentation tasks. Notably, our single SEEM model achieves
competitive performance across interactive segmentation, generic segmentation,
referring segmentation, and video object segmentation on 9 datasets with
minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity
for generalization to novel prompts or their combinations, rendering it a
readily universal image segmentation interface.
Related papers
- Text4Seg: Reimagining Image Segmentation as Text Generation [32.230379277018194]
We introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem.
Key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label.
We show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones.
arXiv Detail & Related papers (2024-10-13T14:28:16Z) - Interactive Segmentation for Diverse Gesture Types Without Context [19.29886866117842]
We propose a simplified interactive segmentation task where a user only must mark an image.
The input can be of any gesture type without specifying the gesture type.
We analyze numerous interactive segmentation algorithms, including ones adapted for our novel task.
arXiv Detail & Related papers (2023-07-20T01:37:32Z) - AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation.
We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.