Instruct-Imagen: Image Generation with Multi-modal Instruction
- URL: http://arxiv.org/abs/2401.01952v1
- Date: Wed, 3 Jan 2024 19:31:58 GMT
- Title: Instruct-Imagen: Image Generation with Multi-modal Instruction
- Authors: Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li,
Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang,
Xuhui Jia
- Abstract summary: instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
- Score: 90.04481955523514
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce *multi-modal
instruction* for image generation, a task representation articulating a range
of generation intents with precision. It uses natural language to amalgamate
disparate modalities (e.g., text, edge, style, subject, etc.), such that
abundant generation intents can be standardized in a uniform format.
We then build instruct-imagen by fine-tuning a pre-trained text-to-image
diffusion model with a two-stage framework. First, we adapt the model using the
retrieval-augmented training, to enhance model's capabilities to ground its
generation on external multimodal context. Subsequently, we fine-tune the
adapted model on diverse image generation tasks that requires vision-language
understanding (e.g., subject-driven generation, etc.), each paired with a
multi-modal instruction encapsulating the task's essence. Human evaluation on
various image generation datasets reveals that instruct-imagen matches or
surpasses prior task-specific models in-domain and demonstrates promising
generalization to unseen and more complex tasks.
Related papers
- UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - Apollo: Zero-shot MultiModal Reasoning with Multiple Experts [14.359111652624899]
We propose a modular framework that leverages the expertise of different foundation models over different modalities and domains.
Our approach enables decentralized command execution and allows each model to both contribute and benefit from the expertise of the other models.
We demonstrate this method on a novel task, audio-aware image captioning, in which an image and audio are given and the task is to generate text that describes the image within the context of the provided audio.
arXiv Detail & Related papers (2023-10-25T22:36:40Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.