Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion
- URL: http://arxiv.org/abs/2503.07047v1
- Date: Mon, 10 Mar 2025 08:34:31 GMT
- Title: Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion
- Authors: Yongle Zhang, Yimin Liu, Qiang Wu,
- Abstract summary: Diffusion models have become widely adopted in image completion tasks.<n>A persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background.<n>We propose supplementing text-based guidance with a novel visual aid: a casual sketch.<n>This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background.
- Score: 13.846868357952419
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have become widely adopted in image completion tasks, with text prompts commonly employed to ensure semantic coherence by providing high-level guidance. However, a persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. While text prompts offer semantic direction, they often fail to precisely recover fine-grained structural details, such as the object's overall posture, ensuring alignment with the visible object information in the background. This limitation stems from the inability of text prompts to provide pixel-level specificity. To address this, we propose supplementing text-based guidance with a novel visual aid: a casual sketch, which can be roughly drawn by anyone based on visible object parts. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background. We introduce the Visual Sketch Self-Aware (VSSA) model, which integrates the casual sketch into each iterative step of the diffusion process, offering distinct advantages for partially corrupted scenarios. By blending sketch-derived features with those of the corrupted image, and leveraging text prompt guidance, the VSSA assists the diffusion model in generating images that preserve both the intended object semantics and structural consistency across the restored objects and original regions. To support this research, we created two datasets, CUB-sketch and MSCOCO-sketch, each combining images, sketches, and text. Extensive qualitative and quantitative experiments demonstrate that our approach outperforms several state-of-the-art methods.
Related papers
- SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation.
Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z) - Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models [78.90023746996302]
Add-it is a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources.
Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement.
Human evaluations show that Add-it is preferred in over 80% of cases.
arXiv Detail & Related papers (2024-11-11T18:50:09Z) - Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions [30.148969711689773]
We present a novel approach designed to address the complexities posed by challenging, out-of-distribution data in the single-image depth estimation task.
We systematically generate new, user-defined scenes with a comprehensive set of challenges and associated depth information.
This is achieved by leveraging cutting-edge text-to-image diffusion models with depth-aware control.
arXiv Detail & Related papers (2024-07-23T17:59:59Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [5.452759083801634]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [12.057465578064345]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.<n>We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - SIEDOB: Semantic Image Editing by Disentangling Object and Background [5.149242555705579]
We propose a novel paradigm for semantic image editing.
textbfSIEDOB, the core idea of which is to explicitly leverage several heterogeneousworks for objects and backgrounds.
We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines.
arXiv Detail & Related papers (2023-03-23T06:17:23Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.