InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following
- URL: http://arxiv.org/abs/2312.06738v4
- Date: Thu, 17 Oct 2024 01:30:33 GMT
- Title: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following
- Authors: Shufan Li, Harkanwar Singh, Aditya Grover,
- Abstract summary: InstructAny2Pix is a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text.
We demonstrate that our system can perform a series of novel instruction-guided editing tasks.
- Score: 26.457571615782985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git
Related papers
- X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation [7.61087111021017]
We propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities.
X2I shows a decrease in performance degradation of less than 1% while gaining various multimodal understanding abilities.
arXiv Detail & Related papers (2025-03-08T09:07:45Z) - Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models.
We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - InstructSeq: Unifying Vision Tasks with Instruction-conditioned
Multi-modal Sequence Generation [59.24938416319019]
InstructSeq is an instruction-conditioned multi-modal modeling framework.
It unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.
arXiv Detail & Related papers (2023-11-30T18:59:51Z) - Apollo: Zero-shot MultiModal Reasoning with Multiple Experts [14.359111652624899]
We propose a modular framework that leverages the expertise of different foundation models over different modalities and domains.
Our approach enables decentralized command execution and allows each model to both contribute and benefit from the expertise of the other models.
We demonstrate this method on a novel task, audio-aware image captioning, in which an image and audio are given and the task is to generate text that describes the image within the context of the provided audio.
arXiv Detail & Related papers (2023-10-25T22:36:40Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.