LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media
Generators
- URL: http://arxiv.org/abs/2311.03716v1
- Date: Tue, 7 Nov 2023 04:44:40 GMT
- Title: LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media
Generators
- Authors: Allen Roush, Emil Zakirov, Artemiy Shirokov, Polina Lunina, Jack Gane,
Alexander Duffy, Charlie Basil, Aber Whitcomb, Jim Benedetto, Chris DeWolfe
- Abstract summary: We describe the techniques that can be used to make Large Language Models (LLMs) act as Art Directors that enhance image and video generation.
We explore how LaDi integrates multiple techniques for augmenting the capabilities of text-to-image generators (T2Is) and text-to-video generators (T2Vs)
- Score: 33.7054351451505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in text-to-image generation have revolutionized numerous
fields, including art and cinema, by automating the generation of high-quality,
context-aware images and video. However, the utility of these technologies is
often limited by the inadequacy of text prompts in guiding the generator to
produce artistically coherent and subject-relevant images. In this paper, We
describe the techniques that can be used to make Large Language Models (LLMs)
act as Art Directors that enhance image and video generation. We describe our
unified system for this called "LaDi". We explore how LaDi integrates multiple
techniques for augmenting the capabilities of text-to-image generators (T2Is)
and text-to-video generators (T2Vs), with a focus on constrained decoding,
intelligent prompting, fine-tuning, and retrieval. LaDi and these techniques
are being used today in apps and platforms developed by Plai Labs.
Related papers
- Text-Animator: Controllable Visual Text Video Generation [149.940821790235]
We propose an innovative approach termed Text-Animator for visual text video generation.
Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos.
We also develop a camera control module and a text refinement module to improve the stability of generated visual text.
arXiv Detail & Related papers (2024-06-25T17:59:41Z) - LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing.
We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts.
Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z) - A Survey of AI Text-to-Image and AI Text-to-Video Generators [0.4662017507844857]
Text-to-Image and Text-to-Video AI generation models are revolutionary technologies that use deep learning and natural language processing (NLP) techniques to create images and videos from textual descriptions.
This paper investigates cutting-edge approaches in the discipline of Text-to-Image and Text-to-Video AI generations.
arXiv Detail & Related papers (2023-11-10T17:33:58Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Language Models Can See: Plugging Visual Controls in Text Generation [48.05127160095048]
We propose a training-free framework, called MAGIC, for plugging in visual controls in the generation process.
MAGIC is a plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation.
On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup.
arXiv Detail & Related papers (2022-05-05T13:56:18Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.