Related papers: MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

URL: http://arxiv.org/abs/2506.23482v1
Date: Mon, 30 Jun 2025 03:06:54 GMT
Title: MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting
Authors: Jun Huang, Ting Liu, Yihang Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu,
Abstract summary: We present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting.<n>Based on MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs.<n>To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix.
Score: 24.950822394526554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.

Related papers

BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control [4.984382582612786]
PainterNet is a plugin that can be flexibly embedded into various diffusion models.<n>We propose local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas.<n>Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.
arXiv Detail & Related papers (2024-12-02T07:40:47Z)
I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting [8.94249680213101]
Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style.<n>We introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts.<n>Our pipeline delivers creative and accurate inpainting results.
arXiv Detail & Related papers (2024-11-28T10:55:09Z)
DiffSTR: Controlled Diffusion Models for Scene Text Removal [5.790630195329777]
Scene Text Removal (STR) aims to prevent unauthorized use of text in images. STR faces several challenges, including boundary artifacts, inconsistent texture and color, and preserving correct shadows. We introduce a ControlNet diffusion model, treating STR as an inpainting task. We develop a mask pretraining pipeline to condition our diffusion model.
arXiv Detail & Related papers (2024-10-29T04:20:21Z)
Improving Text-guided Object Inpainting with Semantic Pre-inpainting [95.17396565347936]
We decompose the typical single-stage object inpainting into two cascaded processes: semantic pre-inpainting and high-fieldity object generation. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion framework.
arXiv Detail & Related papers (2024-09-12T17:55:37Z)
Paint by Inpaint: Learning to Add Image Objects by Removing Them First [8.399234415641319]
We train a diffusion model to inverse the inpainting process, effectively adding objects into images.<n>Our results show that the trained model surpasses existing models in both object addition and general editing tasks.
arXiv Detail & Related papers (2024-04-28T15:07:53Z)
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion [61.90969199199739]
BrushNet is a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM. BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.
arXiv Detail & Related papers (2024-03-11T17:59:31Z)
Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency [78.0488707697235]
Post-processing approach dubbed ASUKA (Aligned Stable inpainting with UnKnown Areas prior) to improve inpainting models.<n>Masked Auto-Encoder (MAE) for reconstruction-based priors mitigates object hallucination.<n> specialized VAE decoder that treats latent-to-image decoding as a local task.
arXiv Detail & Related papers (2023-12-08T05:08:06Z)
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z)
SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model [27.91089554671927]
Generic image inpainting aims to complete a corrupted image by borrowing surrounding information. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance.
arXiv Detail & Related papers (2022-12-09T18:36:13Z)
Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.