Prompt-Based Multi-Modal Image Segmentation
- URL: http://arxiv.org/abs/2112.10003v1
- Date: Sat, 18 Dec 2021 21:27:19 GMT
- Title: Prompt-Based Multi-Modal Image Segmentation
- Authors: Timo L\"uddecke and Alexander S. Ecker
- Abstract summary: We propose a system that can generate image segmentations based on arbitrary prompts at test time.
A prompt can be either a text or an image.
We build upon the CLIP model as a backbone which we extend with a transformer-based decoder.
- Score: 81.58378196535003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image segmentation is usually addressed by training a model for a fixed set
of object classes. Incorporating additional classes or more complex queries
later is expensive as it requires re-training the model on a dataset that
encompasses these expressions. Here we propose a system that can generate image
segmentations based on arbitrary prompts at test time. A prompt can be either a
text or an image. This approach enables us to create a unified model (trained
once) for three common segmentation tasks, which come with distinct challenges:
referring expression segmentation, zero-shot segmentation and one-shot
segmentation. We build upon the CLIP model as a backbone which we extend with a
transformer-based decoder that enables dense prediction. After training on an
extended version of the PhraseCut dataset, our system generates a binary
segmentation map for an image based on a free-text prompt or on an additional
image expressing the query. Different variants of the latter image-based
prompts are analyzed in detail. This novel hybrid input allows for dynamic
adaptation not only to the three segmentation tasks mentioned above, but to any
binary segmentation task where a text or image query can be formulated.
Finally, we find our system to adapt well to generalized queries involving
affordances or properties. Source code: https://eckerlab.org/code/clipseg
Related papers
- IFSENet : Harnessing Sparse Iterations for Interactive Few-shot Segmentation Excellence [2.822194296769473]
Few-shot segmentation techniques reduce the required number of images to learn to segment a new class.
interactive segmentation techniques only focus on incrementally improving the segmentation of one object at a time.
We combine the two concepts to drastically reduce the effort required to train segmentation models for novel classes.
arXiv Detail & Related papers (2024-03-22T10:15:53Z) - Unsupervised Universal Image Segmentation [59.0383635597103]
We propose an Unsupervised Universal model (U2Seg) adept at performing various image segmentation tasks.
U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models.
We then self-train the model on these pseudo semantic labels, yielding substantial performance gains.
arXiv Detail & Related papers (2023-12-28T18:59:04Z) - Text and Click inputs for unambiguous open vocabulary instance
segmentation [21.03169732771627]
We propose a new segmentation process, Text + Click, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment.
We demonstrate that the combination of a single user-specified foreground click and a text prompt allows a model to better disambiguate overlapping or co-occurring semantic categories.
arXiv Detail & Related papers (2023-11-24T19:37:57Z) - Distilling Ensemble of Explanations for Weakly-Supervised Pre-Training
of Image Segmentation Models [54.49581189337848]
We propose a method to enable the end-to-end pre-training for image segmentation models based on classification datasets.
The proposed method leverages a weighted segmentation learning procedure to pre-train the segmentation network en masse.
Experiment results show that, with ImageNet accompanied by PSSL as the source dataset, the proposed end-to-end pre-training strategy successfully boosts the performance of various segmentation models.
arXiv Detail & Related papers (2022-07-04T13:02:32Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z) - Semantically Meaningful Class Prototype Learning for One-Shot Image
Semantic Segmentation [58.96902899546075]
One-shot semantic image segmentation aims to segment the object regions for the novel class with only one annotated image.
Recent works adopt the episodic training strategy to mimic the expected situation at testing time.
We propose to leverage the multi-class label information during the episodic training. It will encourage the network to generate more semantically meaningful features for each category.
arXiv Detail & Related papers (2021-02-22T12:07:35Z) - CRNet: Cross-Reference Networks for Few-Shot Segmentation [59.85183776573642]
Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images.
With a cross-reference mechanism, our network can better find the co-occurrent objects in the two images.
Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-03-24T04:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.