Harnessing Diffusion Models for Visual Perception with Meta Prompts
- URL: http://arxiv.org/abs/2312.14733v1
- Date: Fri, 22 Dec 2023 14:40:55 GMT
- Title: Harnessing Diffusion Models for Visual Perception with Meta Prompts
- Authors: Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang
- Abstract summary: We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
- Score: 68.78938846041767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The issue of generative pretraining for vision models has persisted as a
long-standing conundrum. At present, the text-to-image (T2I) diffusion model
demonstrates remarkable proficiency in generating high-definition images
matching textual inputs, a feat made possible through its pre-training on
large-scale image-text pairs. This leads to a natural inquiry: can diffusion
models be utilized to tackle visual perception tasks? In this paper, we propose
a simple yet effective scheme to harness a diffusion model for visual
perception tasks. Our key insight is to introduce learnable embeddings (meta
prompts) to the pre-trained diffusion models to extract proper features for
perception. The effect of meta prompts are two-fold. First, as a direct
replacement of the text embeddings in the T2I models, it can activate
task-relevant features during feature extraction. Second, it will be used to
re-arrange the extracted features to ensures that the model focuses on the most
pertinent features for the task on hand. Additionally, we design a recurrent
refinement training strategy that fully leverages the property of diffusion
models, thereby yielding stronger visual features. Extensive experiments across
various benchmarks validate the effectiveness of our approach. Our approach
achieves new performance records in depth estimation tasks on NYU depth V2 and
KITTI, and in semantic segmentation task on CityScapes. Concurrently, the
proposed method attains results comparable to the current state-of-the-art in
semantic segmentation on ADE20K and pose estimation on COCO datasets, further
exemplifying its robustness and versatility.
Related papers
- DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? [49.84679952948808]
Recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks.
We conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors.
Our work culminates in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks.
arXiv Detail & Related papers (2024-03-10T04:23:24Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Generative Model-based Feature Knowledge Distillation for Action
Recognition [11.31068233536815]
Our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model.
The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets.
arXiv Detail & Related papers (2023-12-14T03:55:29Z) - VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view.
Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network.
We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.