Related papers: VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

URL: http://arxiv.org/abs/2412.11594v3
Date: Sat, 28 Dec 2024 04:59:45 GMT
Title: VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
Authors: Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song,
Abstract summary: We present VersaGen, a generative AI agent that enables versatile visual control in text-to-image (T2I) synthesis.<n>We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process.
Score: 59.12590059101254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

Related papers

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
DynamiCtrl is a novel framework that explores different pose-guided structures in MM-DiT. We propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion.
arXiv Detail & Related papers (2025-03-27T08:07:45Z)
Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage [34.72900198337818]
We introduce a new disentangled and controllable human synthesis task. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. We propose a stage-by-stage framework that decomposes human image generation into three sequential steps.
arXiv Detail & Related papers (2025-03-25T09:23:20Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
FSViewFusion: Few-Shots View Generation of Novel Objects [75.81872204650807]
We introduce a pretrained stable diffusion model for view synthesis without explicit 3D priors. Specifically, we base our method on a personalized text to image model, Dreambooth, given its strong ability to adapt to specific novel objects with a few shots. We establish that the concept of a view can be disentangled and transferred to a novel object irrespective of the original object's identify from which the views are learnt.
arXiv Detail & Related papers (2024-03-11T02:59:30Z)
Collaborative Score Distillation for Consistent Visual Synthesis [70.29294250371312]
Collaborative Score Distillation (CSD) is based on the Stein Variational Gradient Descent (SVGD) We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
arXiv Detail & Related papers (2023-07-04T17:31:50Z)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks. UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts. trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z)
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold [79.94300820221996]
DragGAN is a new way of controlling generative adversarial networks (GANs) DragGAN allows anyone to deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking.
arXiv Detail & Related papers (2023-05-18T13:41:25Z)
Text-driven Visual Synthesis with Latent Diffusion Prior [37.736313030226654]
We present a generic approach using latent diffusion models as powerful image priors for various visual synthesis tasks. We demonstrate the efficacy of our approach on three different applications, text-to-3D, StyleGAN adaptation, and layered image editing.
arXiv Detail & Related papers (2023-02-16T18:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.