Remember What You have drawn: Semantic Image Manipulation with Memory
- URL: http://arxiv.org/abs/2107.12579v1
- Date: Tue, 27 Jul 2021 03:41:59 GMT
- Title: Remember What You have drawn: Semantic Image Manipulation with Memory
- Authors: Xiangxi Shi, Zhonghua Wu, Guosheng Lin, Jianfei Cai and Shafiq Joty
- Abstract summary: We propose a memory-based Image Manipulation Network (MIM-Net) to generate realistic and text-conformed manipulated images.
To learn a robust memory, we propose a novel randomized memory training loss.
Experiments on the four popular datasets show the better performance of our method compared to the existing ones.
- Score: 84.74585786082388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image manipulation with natural language, which aims to manipulate images
with the guidance of language descriptions, has been a challenging problem in
the fields of computer vision and natural language processing (NLP). Currently,
a number of efforts have been made for this task, but their performances are
still distant away from generating realistic and text-conformed manipulated
images. Therefore, in this paper, we propose a memory-based Image Manipulation
Network (MIM-Net), where a set of memories learned from images is introduced to
synthesize the texture information with the guidance of the textual
description. We propose a two-stage network with an additional reconstruction
stage to learn the latent memories efficiently. To avoid the unnecessary
background changes, we propose a Target Localization Unit (TLU) to focus on the
manipulation of the region mentioned by the text. Moreover, to learn a robust
memory, we further propose a novel randomized memory training loss. Experiments
on the four popular datasets show the better performance of our method compared
to the existing ones.
Related papers
- TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training [6.34265125858783]
We propose a noise-robust framework for efficient vision-language pre-training that requires less pre-training data.
Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer.
We introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning.
arXiv Detail & Related papers (2024-09-15T01:54:17Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Memory-Based Label-Text Tuning for Few-Shot Class-Incremental Learning [20.87638654650383]
We propose leveraging the label-text information by adopting the memory prompt.
The memory prompt can learn new data sequentially, and meanwhile store the previous knowledge.
Experiments show that our proposed method outperforms all prior state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-03T13:15:45Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.