VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual
Grounders
- URL: http://arxiv.org/abs/2309.01141v4
- Date: Tue, 23 Jan 2024 15:51:18 GMT
- Title: VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual
Grounders
- Authors: Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
- Abstract summary: VGDiffZero is a zero-shot visual grounding framework based on text-to-image diffusion models.
We show that VGDiffZero achieves strong performance on zero-shot visual grounding.
- Score: 31.371338262371122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale text-to-image diffusion models have shown impressive capabilities
for generative tasks by leveraging strong vision-language alignment from
pre-training. However, most vision-language discriminative tasks require
extensive fine-tuning on carefully-labeled datasets to acquire such alignment,
with great cost in time and computing resources. In this work, we explore
directly applying a pre-trained generative diffusion model to the challenging
discriminative task of visual grounding without any fine-tuning and additional
training dataset. Specifically, we propose VGDiffZero, a simple yet effective
zero-shot visual grounding framework based on text-to-image diffusion models.
We also design a comprehensive region-scoring method considering both global
and local contexts of each isolated proposal. Extensive experiments on RefCOCO,
RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on
zero-shot visual grounding. Our code is available at
https://github.com/xuyang-liu16/VGDiffZero.
Related papers
- Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios.
We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z) - Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection.
CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation [69.72194342962615]
We introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient?
First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch.
Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model.
Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time.
arXiv Detail & Related papers (2024-01-11T18:59:14Z) - GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data.
We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z) - Few-Shot Learning with Visual Distribution Calibration and Cross-Modal
Distribution Alignment [47.53887941065894]
Pre-trained vision-language models have inspired much research on few-shot learning.
With only a few training images, the visual feature distributions are easily distracted by class-irrelevant information in images.
We propose a Selective Attack module that generates spatial attention maps of images to guide the attacks on class-irrelevant image areas.
arXiv Detail & Related papers (2023-05-19T05:45:17Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.