AI Illustrator: Translating Raw Descriptions into Images by Prompt-based
Cross-Modal Generation
- URL: http://arxiv.org/abs/2209.03160v2
- Date: Thu, 8 Sep 2022 04:24:35 GMT
- Title: AI Illustrator: Translating Raw Descriptions into Images by Prompt-based
Cross-Modal Generation
- Authors: Yiyang Ma, Huan Yang, Bei Liu, Jianlong Fu, Jiaying Liu
- Abstract summary: We propose a framework for translating raw descriptions with complex semantics into semantically corresponding images.
Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN.
Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training.
- Score: 61.77946020543875
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: AI illustrator aims to automatically design visually appealing images for
books to provoke rich thoughts and emotions. To achieve this goal, we propose a
framework for translating raw descriptions with complex semantics into
semantically corresponding images. The main challenge lies in the complexity of
the semantics of raw descriptions, which may be hard to be visualized (e.g.,
"gloomy" or "Asian"). It usually poses challenges for existing methods to
handle such descriptions. To address this issue, we propose a Prompt-based
Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful
pre-trained models, including CLIP and StyleGAN. Our framework consists of two
components: a projection module from Text Embeddings to Image Embeddings based
on prompts, and an adapted image generation module built on StyleGAN which
takes Image Embeddings as inputs and is trained by combined semantic
consistency losses. To bridge the gap between realistic images and illustration
designs, we further adopt a stylization model as post-processing in our
framework for better visual effects. Benefiting from the pre-trained models,
our method can handle complex descriptions and does not require external paired
data for training. Furthermore, we have built a benchmark that consists of 200
raw descriptions. We conduct a user study to demonstrate our superiority over
the competing methods with complicated texts. We release our code at
https://github.com/researchmm/AI_Illustrator.
Related papers
- A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - DiffMorph: Text-less Image Morphing with Diffusion Models [0.0]
verb|DiffMorph| synthesizes images that mix concepts without the use of textual prompts.
verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image.
We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully.
arXiv Detail & Related papers (2024-01-01T12:42:32Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z) - FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion [16.583537785874604]
We propose a novel text-conditioned editing model, called FICE, capable of handling a wide variety of diverse text descriptions.
FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.
arXiv Detail & Related papers (2023-01-05T15:33:23Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.