IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning
- URL: http://arxiv.org/abs/2204.00792v1
- Date: Sat, 2 Apr 2022 07:48:39 GMT
- Title: IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning
- Authors: Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui
Wang, Qingming Huang
- Abstract summary: Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment.
Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
- Score: 110.7118381246156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conditional image generation is an active research topic including text2image
and image translation.
Recently image manipulation with linguistic instruction brings new challenges
of multimodal conditional generation.
However, traditional conditional image generation models mainly focus on
generating high-quality and visually realistic images, and lack resolving the
partial consistency between image and instruction.
To address this issue, we propose an Increment Reasoning Generative
Adversarial Network (IR-GAN), which aims to reason the consistency between
visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders
to learn user's intention from history-correlated instructions as semantic
increment.
Second, we embed the representation of semantic increment into that of source
image for generating target image, where source image plays the role of
referring auxiliary.
Finally, we propose a reasoning discriminator to measure the consistency
between visual increment and semantic increment, which purifies user's
intention and guarantees the good logic of generated target image.
Extensive experiments and visualization conducted on two datasets show the
effectiveness of IR-GAN.
Related papers
- Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z) - Wavelet-based Unsupervised Label-to-Image Translation [9.339522647331334]
We propose a new Unsupervised paradigm for SIS (USIS) that makes use of a self-supervised segmentation loss and whole image wavelet based discrimination.
We test our methodology on 3 challenging datasets and demonstrate its ability to bridge the performance gap between paired and unpaired models.
arXiv Detail & Related papers (2023-05-16T17:48:44Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - LatteGAN: Visually Guided Language Attention for Multi-Turn
Text-Conditioned Image Manipulation [0.0]
We present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN)
LatteGAN extracts fine-grained text representations for the generator, and discriminates both the global and local representations of fake or real images.
Experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.
arXiv Detail & Related papers (2021-12-28T03:50:03Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.