PAI-Diffusion: Constructing and Serving a Family of Open Chinese
Diffusion Models for Text-to-image Synthesis on the Cloud
- URL: http://arxiv.org/abs/2309.05534v1
- Date: Mon, 11 Sep 2023 15:18:28 GMT
- Title: PAI-Diffusion: Constructing and Serving a Family of Open Chinese
Diffusion Models for Text-to-image Synthesis on the Cloud
- Authors: Chengyu Wang, Zhongjie Duan, Bingyan Liu, Xinyi Zou, Cen Chen, Kui
Jia, Jun Huang
- Abstract summary: This paper introduces PAI-Diffusion, a comprehensive framework for Chinese text-to-image synthesis.
It incorporates both general and domain-specific Chinese diffusion models, enabling the generation of contextually relevant images.
It seamlessly integrates with Alibaba Cloud's Machine Learning Platform for AI, providing accessible and scalable solutions.
- Score: 54.046884854230555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis for the Chinese language poses unique challenges due
to its large vocabulary size, and intricate character relationships. While
existing diffusion models have shown promise in generating images from textual
descriptions, they often neglect domain-specific contexts and lack robustness
in handling the Chinese language. This paper introduces PAI-Diffusion, a
comprehensive framework that addresses these limitations. PAI-Diffusion
incorporates both general and domain-specific Chinese diffusion models,
enabling the generation of contextually relevant images. It explores the
potential of using LoRA and ControlNet for fine-grained image style transfer
and image editing, empowering users with enhanced control over image
generation. Moreover, PAI-Diffusion seamlessly integrates with Alibaba Cloud's
Machine Learning Platform for AI, providing accessible and scalable solutions.
All the Chinese diffusion model checkpoints, LoRAs, and ControlNets, including
domain-specific ones, are publicly available. A user-friendly Chinese WebUI and
the diffusers-api elastic inference toolkit, also open-sourced, further
facilitate the easy deployment of PAI-Diffusion models in various environments,
making it a valuable resource for Chinese text-to-image synthesis.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z) - Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support [35.17427411750043]
We present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model.
We extend the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training.
Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.
arXiv Detail & Related papers (2024-01-26T07:17:50Z) - ZRIGF: An Innovative Multimodal Framework for Zero-Resource
Image-Grounded Dialogue Generation [17.310200022696016]
ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training.
Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses.
arXiv Detail & Related papers (2023-08-01T09:28:36Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal
Image Generation [21.455774034659978]
MultiFusion allows one to express complex concepts with arbitrarily interleaved inputs of multiple modalities and languages.
MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system.
arXiv Detail & Related papers (2023-05-24T16:22:18Z) - Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with
Images as Pivots [80.32906566894171]
We propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese.
IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently.
Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%10% training data.
arXiv Detail & Related papers (2023-05-19T09:20:27Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.