Language Models Can See: Plugging Visual Controls in Text Generation
- URL: http://arxiv.org/abs/2205.02655v1
- Date: Thu, 5 May 2022 13:56:18 GMT
- Title: Language Models Can See: Plugging Visual Controls in Text Generation
- Authors: Yixuan Su and Tian Lan and Yahui Liu and Fangyu Liu and Dani Yogatama
and Yan Wang and Lingpeng Kong and Nigel Collier
- Abstract summary: We propose a training-free framework, called MAGIC, for plugging in visual controls in the generation process.
MAGIC is a plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation.
On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup.
- Score: 48.05127160095048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative language models (LMs) such as GPT-2/3 can be prompted to generate
text with remarkable quality. While they are designed for text-prompted
generation, it remains an open question how the generation process could be
guided by modalities beyond text such as images. In this work, we propose a
training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP),
for plugging in visual controls in the generation process and enabling LMs to
perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC
is a simple yet efficient plug-and-play framework, which directly combines an
off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP)
for image-grounded text generation. During decoding, MAGIC influences the
generation of the LM by introducing a CLIP-induced score, called magic score,
which regularizes the generated result to be semantically related to a given
image while being coherent to the previously generated context. Notably, the
proposed decoding scheme does not involve any gradient update operation,
therefore being computationally efficient. On the challenging task of zero-shot
image captioning, MAGIC outperforms the state-of-the-art method by notable
margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework
and is theoretically compatible with any text generation tasks that incorporate
image grounding. In the experiments, we showcase that it is also capable of
performing visually grounded story generation given both an image and a text
prompt.
Related papers
- LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [62.232361821779335]
We introduce a tuning-free attention control framework, encapsulated by the progressive process of prompt-Aware editing, StablE animation geneRation, abbreviated as LASER.
We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity.
Our meticulous control over spatial features and self-attention ensures structural consistency in the images.
arXiv Detail & Related papers (2024-04-21T07:13:56Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.