VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2412.11594v3
- Date: Sat, 28 Dec 2024 04:59:45 GMT
- Title: VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
- Authors: Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song,
- Abstract summary: We present VersaGen, a generative AI agent that enables versatile visual control in text-to-image (T2I) synthesis.
We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process.
- Score: 59.12590059101254
- License:
- Abstract: Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.
Related papers
- I2VControl: Disentangled and Unified Video Motion Synthesis Control [11.83645633418189]
We present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis.
Our approach partitions the video into individual motion units and represents each unit with disentangled control signals.
Our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures.
arXiv Detail & Related papers (2024-11-26T04:21:22Z) - AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - FSViewFusion: Few-Shots View Generation of Novel Objects [75.81872204650807]
We introduce a pretrained stable diffusion model for view synthesis without explicit 3D priors.
Specifically, we base our method on a personalized text to image model, Dreambooth, given its strong ability to adapt to specific novel objects with a few shots.
We establish that the concept of a view can be disentangled and transferred to a novel object irrespective of the original object's identify from which the views are learnt.
arXiv Detail & Related papers (2024-03-11T02:59:30Z) - Collaborative Score Distillation for Consistent Visual Synthesis [70.29294250371312]
Collaborative Score Distillation (CSD) is based on the Stein Variational Gradient Descent (SVGD)
We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes.
Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
arXiv Detail & Related papers (2023-07-04T17:31:50Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z) - Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold [79.94300820221996]
DragGAN is a new way of controlling generative adversarial networks (GANs)
DragGAN allows anyone to deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc.
Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking.
arXiv Detail & Related papers (2023-05-18T13:41:25Z) - Text-driven Visual Synthesis with Latent Diffusion Prior [37.736313030226654]
We present a generic approach using latent diffusion models as powerful image priors for various visual synthesis tasks.
We demonstrate the efficacy of our approach on three different applications, text-to-3D, StyleGAN adaptation, and layered image editing.
arXiv Detail & Related papers (2023-02-16T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.