Unleashing Text-to-Image Diffusion Models for Visual Perception
- URL: http://arxiv.org/abs/2303.02153v1
- Date: Fri, 3 Mar 2023 18:59:47 GMT
- Title: Unleashing Text-to-Image Diffusion Models for Visual Perception
- Authors: Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu
- Abstract summary: VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
- Score: 84.41514649568094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models (DMs) have become the new trend of generative models and
have demonstrated a powerful ability of conditional synthesis. Among those,
text-to-image diffusion models pre-trained on large-scale image-text pairs are
highly controllable by customizable prompts. Unlike the unconditional
generative models that focus on low-level attributes and details, text-to-image
diffusion models contain more high-level knowledge thanks to the
vision-language pre-training. In this paper, we propose VPD (Visual Perception
with a pre-trained Diffusion model), a new framework that exploits the semantic
information of a pre-trained text-to-image diffusion model in visual perception
tasks. Instead of using the pre-trained denoising autoencoder in a
diffusion-based pipeline, we simply use it as a backbone and aim to study how
to take full advantage of the learned knowledge. Specifically, we prompt the
denoising decoder with proper textual inputs and refine the text features with
an adapter, leading to a better alignment to the pre-trained stage and making
the visual contents interact with the text prompts. We also propose to utilize
the cross-attention maps between the visual features and the text features to
provide explicit guidance. Compared with other pre-training methods, we show
that vision-language pre-trained diffusion models can be faster adapted to
downstream visual perception tasks using the proposed VPD. Extensive
experiments on semantic segmentation, referring image segmentation and depth
estimation demonstrates the effectiveness of our method. Notably, VPD attains
0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring
image segmentation, establishing new records on these two benchmarks. Code is
available at https://github.com/wl-zhao/VPD
Related papers
- Implicit and Explicit Language Guidance for Diffusion-based Visual Perception [42.71751651417168]
Text-to-image diffusion models can generate high-quality images with rich texture and reasonable structure under different text prompts.
We propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP.
Our IEDP achieves promising performance on two typical perception tasks, including semantic segmentation and depth estimation.
arXiv Detail & Related papers (2024-04-11T09:39:58Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.