Text-Guided Neural Image Inpainting
- URL: http://arxiv.org/abs/2004.03212v4
- Date: Mon, 22 Mar 2021 08:12:02 GMT
- Title: Text-Guided Neural Image Inpainting
- Authors: Lisai Zhang, Qingcai Chen, Baotian Hu, and Shuoran Jiang
- Abstract summary: Inpainting task requires filling the corrupted image with contents coherent with the context.
The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text.
We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
- Score: 20.551488941041256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image inpainting task requires filling the corrupted image with contents
coherent with the context. This research field has achieved promising progress
by using neural image inpainting methods. Nevertheless, there is still a
critical challenge in guessing the missed content with only the context pixels.
The goal of this paper is to fill the semantic information in corrupted images
according to the provided descriptive text. Unique from existing text-guided
image generation works, the inpainting models are required to compare the
semantic content of the given text and the remaining part of the image, then
find out the semantic content that should be filled for missing part. To
fulfill such a task, we propose a novel inpainting model named Text-Guided Dual
Attention Inpainting Network (TDANet). Firstly, a dual multimodal attention
mechanism is designed to extract the explicit semantic information about the
corrupted regions, which is done by comparing the descriptive text and
complementary image areas through reciprocal attention. Secondly, an image-text
matching loss is applied to maximize the semantic similarity of the generated
image and the text. Experiments are conducted on two open datasets. Results
show that the proposed TDANet model reaches new state-of-the-art on both
quantitative and qualitative measures. Result analysis suggests that the
generated images are consistent with the guidance text, enabling the generation
of various results by providing different descriptions. Codes are available at
https://github.com/idealwhite/TDANet
Related papers
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction [23.683636588751753]
State-of-the-art inpainting methods are mainly designed for nature images and cannot correctly recover text within scene text images.
We identify the visual-text inpainting task to achieve high-quality scene text image restoration and text completion.
arXiv Detail & Related papers (2024-07-23T06:12:19Z) - Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - CIGLI: Conditional Image Generation from Language & Image [5.159265382427163]
We propose a new task called CIGLI: Conditional Image Generation from Language and Image.
Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt.
arXiv Detail & Related papers (2021-08-20T00:58:42Z) - Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents.
We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z) - VICTR: Visual Information Captured Text Representation for Text-to-Image
Multimodal Tasks [5.840117063192334]
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input.
We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks.
The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
arXiv Detail & Related papers (2020-10-07T05:25:30Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Describe What to Change: A Text-guided Unsupervised Image-to-Image
Translation Approach [84.22327278486846]
We propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence.
Our model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description.
Experiments show that the proposed model achieves promising performances on two large-scale public datasets.
arXiv Detail & Related papers (2020-08-10T15:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.