Related papers: Locate, Assign, Refine: Taming Customized Promptable Image Inpainting

Locate, Assign, Refine: Taming Customized Promptable Image Inpainting

URL: http://arxiv.org/abs/2403.19534v2
Date: Wed, 22 Jan 2025 15:37:39 GMT
Title: Locate, Assign, Refine: Taming Customized Promptable Image Inpainting
Authors: Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang, Xiangteng He,
Abstract summary: We introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting.<n>We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt.<n>Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency.
Score: 22.163855501668206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prior studies have made significant progress in image inpainting guided by either text description or subject image. However, the research on inpainting with flexible guidance or control, i.e., text-only, image-only, and their combination, is still in the early stage. Therefore, in this paper, we introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting. We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt, incorporating both the text prompt and image prompt. Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency. It consists of three mechanisms: (i) Locate mechanism: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign mechanism: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine mechanism: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data engine to automatically extract substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image data, leveraging publicly available pre-trained large models. Extensive experiments and various application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.

Related papers

COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization [32.26473230517668]
COCOInpaint is a benchmark specifically designed for inpainting detection. High-quality inpainting samples generated by six state-of-the-art inpainting models. Large-scale coverage with 258,266 inpainted images with rich semantic diversity.
arXiv Detail & Related papers (2025-04-25T14:04:36Z)
SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation. Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z)
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator [44.620847608977776]
Diptych Prompting is a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment. Our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing.
arXiv Detail & Related papers (2024-11-23T06:17:43Z)
Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z)
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models. Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z)
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language. Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z)
DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models [37.133727797607676]
This study introduces Text-Guided Subject-Driven Image Inpainting. We compute dense subject features to ensure accurate subject replication. We employ a discriminative token selection module to eliminate redundant subject details.
arXiv Detail & Related papers (2023-12-05T22:23:19Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD) In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z)
Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z)
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world. The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally. Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents. We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.