VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model
- URL: http://arxiv.org/abs/2406.01059v2
- Date: Sat, 3 Aug 2024 08:52:32 GMT
- Title: VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model
- Authors: Jinze Yang, Haoran Wang, Zining Zhu, Chenglong Liu, Meng Wymond Wu, Zeke Xie, Zhong Ji, Jungong Han, Mingming Sun,
- Abstract summary: This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users.
We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image.
In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
- Score: 76.02314305164595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. First of all, we take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. Accordingly, the obtained text prompts are introduced to endow our model with the capacity to customize the outpainting results. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts. Note that unlike most existing methods, our approach is very resource-efficient since it is just slightly fine-tuned on the off-the-shelf stable diffusion (SD) model rather than being trained from scratch. Finally, the experimental results on three commonly used datasets, i.e. Scenery, Building, and WikiArt, demonstrate our model significantly surpasses the SoTA methods. Moreover, versatile outpainting results are listed to show its customized ability.
Related papers
- mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models [71.40705814904898]
We introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding.
Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space.
arXiv Detail & Related papers (2024-08-09T03:25:42Z) - FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting [28.65445105418749]
We introduce Anywhere, a pioneering multi-agent framework designed to address challenges in inpainting foreground images.
Anywhere employs various agents such as Visual Language Model, Large Language Model, and image generation models.
It excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity.
arXiv Detail & Related papers (2024-04-29T11:13:37Z) - BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed
Dual-Branch Diffusion [61.90969199199739]
BrushNet is a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM.
BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.
arXiv Detail & Related papers (2024-03-11T17:59:31Z) - UMAIR-FPS: User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style [5.441554441737648]
We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS)
In the feature extract phase, for image features, we are the first to combine image painting style features with semantic features to construct a dual-output image encoder.
For text features, we obtain text embeddings based on fine-tuning Sentence-Transformers.
In the multi-modal fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism.
arXiv Detail & Related papers (2024-02-16T00:25:53Z) - Towards Language-Driven Video Inpainting via Multimodal Large Language Models [116.22805434658567]
We introduce a new task -- language-driven video inpainting.
It uses natural language instructions to guide the inpainting process.
We present the Remove Objects from Videos by Instructions dataset.
arXiv Detail & Related papers (2024-01-18T18:59:13Z) - PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation
Models Through Prompt Tuning [35.39822183728463]
We present a novel Prompt-IML framework for detecting tampered images.
Humans tend to discern authenticity of an image based on semantic and high-frequency information.
Our model can achieve better performance on eight typical fake image datasets.
arXiv Detail & Related papers (2024-01-01T03:45:07Z) - Uni-paint: A Unified Framework for Multimodal Image Inpainting with
Pretrained Diffusion Model [19.800236358666123]
We propose Uni-paint, a unified framework for multimodal inpainting.
Uni-paint offers various modes of guidance, including text-driven, stroke-driven, exemplar-driven inpainting.
Our approach achieves comparable results to existing single-modal methods.
arXiv Detail & Related papers (2023-10-11T06:11:42Z) - Highly Personalized Text Embedding for Image Manipulation by Stable
Diffusion [34.662798793560995]
We present a simple yet highly effective approach to personalization using highly personalized (PerHi) text embedding.
Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text.
arXiv Detail & Related papers (2023-03-15T17:07:45Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.