Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models
- URL: http://arxiv.org/abs/2308.16777v2
- Date: Fri, 1 Sep 2023 05:57:47 GMT
- Title: Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models
- Authors: Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, Wangmeng
Zuo
- Abstract summary: We introduce a novel Referring Diffusional segmentor (Ref-Diff) for referring image segmentation.
We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models.
This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation.
- Score: 68.73086826874733
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Zero-shot referring image segmentation is a challenging task because it aims
to find an instance segmentation mask based on the given referring
descriptions, without training on this type of paired data. Current zero-shot
methods mainly focus on using pre-trained discriminative models (e.g., CLIP).
However, we have observed that generative models (e.g., Stable Diffusion) have
potentially understood the relationships between various visual elements and
text descriptions, which are rarely investigated in this task. In this work, we
introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task,
which leverages the fine-grained multi-modal information from generative
models. We demonstrate that without a proposal generator, a generative model
alone can achieve comparable performance to existing SOTA weakly-supervised
models. When we combine both generative and discriminative models, our Ref-Diff
outperforms these competing methods by a significant margin. This indicates
that generative models are also beneficial for this task and can complement
discriminative models for better referring segmentation. Our code is publicly
available at https://github.com/kodenii/Ref-Diff.
Related papers
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Generative Multi-modal Models are Good Class-Incremental Learners [51.5648732517187]
We propose a novel generative multi-modal model (GMM) framework for class-incremental learning.
Our approach directly generates labels for images using an adapted generative model.
Under the Few-shot CIL setting, we have improved by at least 14% accuracy over all the current state-of-the-art methods with significantly less forgetting.
arXiv Detail & Related papers (2024-03-27T09:21:07Z) - Diffusion Models Beat GANs on Image Classification [37.70821298392606]
Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc.
We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification.
We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods for classification tasks.
arXiv Detail & Related papers (2023-07-17T17:59:40Z) - DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query)
We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query)
This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z) - MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model
for Few-Shot Instance Segmentation [31.648523213206595]
Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task.
Conventional approaches have attempted to address the task via prototype learning, known as point estimation.
We propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask.
arXiv Detail & Related papers (2023-03-09T08:24:02Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - Generating Representative Samples for Few-Shot Classification [8.62483598990205]
Few-shot learning aims to learn new categories with a few visual samples per class.
Few-shot class representations are often biased due to data scarcity.
We generate visual samples based on semantic embeddings using a conditional variational autoencoder model.
arXiv Detail & Related papers (2022-05-05T20:58:33Z) - Decoupled Multi-task Learning with Cyclical Self-Regulation for Face
Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing.
Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection.
Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z) - Generative Models as a Data Source for Multiview Representation Learning [38.56447220165002]
Generative models are capable of producing realistic images that look nearly indistinguishable from the data on which they are trained.
This raises the question: if we have good enough generative models, do we still need datasets?
We investigate this question in the setting of learning general-purpose visual representations from a black-box generative model.
arXiv Detail & Related papers (2021-06-09T17:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.