Related papers: Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

URL: http://arxiv.org/abs/2509.23919v2
Date: Sun, 09 Nov 2025 08:46:20 GMT
Title: Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models
Authors: Longtao Jiang, Jie Huang, Mingfei Han, Lei Chen, Yongqiang Yu, Feng Zhao, Xiaojun Chang, Zhihui Li,
Abstract summary: We develop a training-free text-guided image inpainting method based on Mask AutoRegressive (MAR) models.<n>Our approach introduces two key components: (1) Dual-Stream Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, and (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens.
Score: 49.15136755850853
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics. Codes: https://github.com/longtaojiang/Token-Painter.

Related papers

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z)
MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting [24.950822394526554]
We present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting.<n>Based on MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs.<n>To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix.
arXiv Detail & Related papers (2025-06-30T03:06:54Z)
DiffSTR: Controlled Diffusion Models for Scene Text Removal [5.790630195329777]
Scene Text Removal (STR) aims to prevent unauthorized use of text in images. STR faces several challenges, including boundary artifacts, inconsistent texture and color, and preserving correct shadows. We introduce a ControlNet diffusion model, treating STR as an inpainting task. We develop a mask pretraining pipeline to condition our diffusion model.
arXiv Detail & Related papers (2024-10-29T04:20:21Z)
Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis [63.757624792753205]
We present Zero-Painter, a framework for layout-conditional text-to-image synthesis. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity.
arXiv Detail & Related papers (2024-06-06T13:02:00Z)
Locate, Assign, Refine: Taming Customized Promptable Image Inpainting [22.163855501668206]
We introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting.<n>We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt.<n>Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z)
BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification [2.3931689873603603]
Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query. How to effectively align images and texts globally and locally is a crucial challenge. We introduce Bidirectional Local-Matching (LMa) framework that jointly optimize Masked Image Modeling (MIM) in TBPReID model training.
arXiv Detail & Related papers (2023-09-09T04:01:24Z)
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning. We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
PaintSeg: Training-free Segmentation via Painting [50.17936803209125]
PaintSeg is a new unsupervised method for segmenting objects without any training. Inpainting and outpainting are alternated, with the former masking the foreground and filling in the background, and the latter masking the background while recovering the missing part of the foreground object. Our experimental results demonstrate that PaintSeg outperforms existing approaches in coarse mask-prompt, box-prompt, and point-prompt segmentation tasks.
arXiv Detail & Related papers (2023-05-30T20:43:42Z)
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z)
SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model [27.91089554671927]
Generic image inpainting aims to complete a corrupted image by borrowing surrounding information. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance.
arXiv Detail & Related papers (2022-12-09T18:36:13Z)
Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents. We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.