InstanceDiffusion: Instance-level Control for Image Generation
- URL: http://arxiv.org/abs/2402.03290v1
- Date: Mon, 5 Feb 2024 18:49:17 GMT
- Title: InstanceDiffusion: Instance-level Control for Image Generation
- Authors: Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar,
Ishan Misra
- Abstract summary: InstanceDiffusion adds precise instance-level control to text-to-image diffusion models.
We propose three major changes to text-to-image models that enable precise instance-level control.
- Score: 89.31908006870422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models produce high quality images but do not offer
control over individual instances in the image. We introduce InstanceDiffusion
that adds precise instance-level control to text-to-image diffusion models.
InstanceDiffusion supports free-form language conditions per instance and
allows flexible ways to specify instance locations such as simple single
points, scribbles, bounding boxes or intricate instance segmentation masks, and
combinations thereof. We propose three major changes to text-to-image models
that enable precise instance-level control. Our UniFusion block enables
instance-level conditions for text-to-image models, the ScaleU block improves
image fidelity, and our Multi-instance Sampler improves generations for
multiple instances. InstanceDiffusion significantly surpasses specialized
state-of-the-art models for each location condition. Notably, on the COCO
dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$
for box inputs, and 25.4% IoU for mask inputs.
Related papers
- A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.
Our approach enables versatile capabilities via different inference-time sampling schemes.
Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z) - FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation [104.03166324080917]
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation.
Our method is training-free and does not rely on any label supervision.
Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models.
arXiv Detail & Related papers (2023-09-22T17:59:42Z) - DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion
Models [10.744438740060458]
We aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description.
We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types.
The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation.
arXiv Detail & Related papers (2023-05-24T14:31:20Z) - LayoutDiffusion: Controllable Diffusion Model for Layout-to-image
Generation [46.567682868550285]
We propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works.
In this paper, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.
Our experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG Code.
arXiv Detail & Related papers (2023-03-30T06:56:12Z) - Few-Shot Diffusion Models [15.828257653106537]
We present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs.
FSDM is trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information.
We empirically show that FSDM can perform few-shot generation and transfer to new datasets.
arXiv Detail & Related papers (2022-05-30T23:20:33Z) - DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models [33.79188588182528]
We present a novel DiffusionCLIP which performs text-driven image manipulation with diffusion models using Contrastive Language-Image Pre-training (CLIP) loss.
Our method has a performance comparable to that of the modern GAN-based image processing methods for in and out-of-domain image processing tasks.
Our method can be easily used for various novel applications, enabling image translation from an unseen domain to another unseen domain or stroke-conditioned image generation in an unseen domain.
arXiv Detail & Related papers (2021-10-06T12:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.