Related papers: LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators

LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators

URL: http://arxiv.org/abs/2311.03716v1
Date: Tue, 7 Nov 2023 04:44:40 GMT
Title: LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators
Authors: Allen Roush, Emil Zakirov, Artemiy Shirokov, Polina Lunina, Jack Gane, Alexander Duffy, Charlie Basil, Aber Whitcomb, Jim Benedetto, Chris DeWolfe
Abstract summary: We describe the techniques that can be used to make Large Language Models (LLMs) act as Art Directors that enhance image and video generation. We explore how LaDi integrates multiple techniques for augmenting the capabilities of text-to-image generators (T2Is) and text-to-video generators (T2Vs)
Score: 33.7054351451505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in text-to-image generation have revolutionized numerous fields, including art and cinema, by automating the generation of high-quality, context-aware images and video. However, the utility of these technologies is often limited by the inadequacy of text prompts in guiding the generator to produce artistically coherent and subject-relevant images. In this paper, We describe the techniques that can be used to make Large Language Models (LLMs) act as Art Directors that enhance image and video generation. We describe our unified system for this called "LaDi". We explore how LaDi integrates multiple techniques for augmenting the capabilities of text-to-image generators (T2Is) and text-to-video generators (T2Vs), with a focus on constrained decoding, intelligent prompting, fine-tuning, and retrieval. LaDi and these techniques are being used today in apps and platforms developed by Plai Labs.

Related papers

Advance Fake Video Detection via Vision Transformers [0.9035379689195373]
Vision Transformer (ViT)-based fake image detection and extend this idea to video. We propose an original %innovative framework that effectively integrates ViT embeddings over time to enhance detection performance. Our method shows promising accuracy, generalization, and few-shot learning capabilities across a new, large and diverse dataset of videos.
arXiv Detail & Related papers (2025-04-29T11:51:07Z)
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation [14.877355149519198]
We introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information.
arXiv Detail & Related papers (2025-02-25T15:42:34Z)
Text-Animator: Controllable Visual Text Video Generation [149.940821790235]
We propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. We also develop a camera control module and a text refinement module to improve the stability of generated visual text.
arXiv Detail & Related papers (2024-06-25T17:59:41Z)
LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [52.16008431411513]
LASER is a tuning-free LLM-driven attention control framework. We propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER.
arXiv Detail & Related papers (2024-04-21T07:13:56Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)
A Survey of AI Text-to-Image and AI Text-to-Video Generators [0.4662017507844857]
Text-to-Image and Text-to-Video AI generation models are revolutionary technologies that use deep learning and natural language processing (NLP) techniques to create images and videos from textual descriptions. This paper investigates cutting-edge approaches in the discipline of Text-to-Image and Text-to-Video AI generations.
arXiv Detail & Related papers (2023-11-10T17:33:58Z)
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z)
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions. Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I) We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
Language Models Can See: Plugging Visual Controls in Text Generation [48.05127160095048]
We propose a training-free framework, called MAGIC, for plugging in visual controls in the generation process. MAGIC is a plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup.
arXiv Detail & Related papers (2022-05-05T13:56:18Z)
Video Generation from Text Employing Latent Path Construction for Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study. In this paper, we tackle the text to video generation problem, which is a conditional form of video generation. We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.