Describe What to Change: A Text-guided Unsupervised Image-to-Image
Translation Approach
- URL: http://arxiv.org/abs/2008.04200v1
- Date: Mon, 10 Aug 2020 15:40:05 GMT
- Title: Describe What to Change: A Text-guided Unsupervised Image-to-Image
Translation Approach
- Authors: Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier
Alameda-Pineda, Nicu Sebe and Bruno Lepri
- Abstract summary: We propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence.
Our model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description.
Experiments show that the proposed model achieves promising performances on two large-scale public datasets.
- Score: 84.22327278486846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Manipulating visual attributes of images through human-written text is a very
challenging task. On the one hand, models have to learn the manipulation
without the ground truth of the desired output. On the other hand, models have
to deal with the inherent ambiguity of natural language. Previous research
usually requires either the user to describe all the characteristics of the
desired image or to use richly-annotated image captioning datasets. In this
work, we propose a novel unsupervised approach, based on image-to-image
translation, that alters the attributes of a given image through a command-like
sentence such as "change the hair color to black". Contrarily to
state-of-the-art approaches, our model does not require a human-annotated
dataset nor a textual description of all the attributes of the desired image,
but only those that have to be modified. Our proposed model disentangles the
image content from the visual attributes, and it learns to modify the latter
using the textual description, before generating a new image from the content
and the modified attribute representation. Because text might be inherently
ambiguous (blond hair may refer to different shadows of blond, e.g. golden,
icy, sandy), our method generates multiple stochastic versions of the same
translation. Experiments show that the proposed model achieves promising
performances on two large-scale public datasets: CelebA and CUB. We believe our
approach will pave the way to new avenues of research combining textual and
speech commands with visual attributes.
Related papers
- Composed Image Retrieval for Remote Sensing [24.107610091033997]
This work introduces composed image retrieval to remote sensing.
It allows to query a large image archive by image examples alternated by a textual description.
A novel method fusing image-to-image and text-to-image similarity is introduced.
arXiv Detail & Related papers (2024-05-24T14:18:31Z) - DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve.
It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image.
It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z) - ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts.
We show that, for some attributes, images can represent concepts more expressively than text.
We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - CIGLI: Conditional Image Generation from Language & Image [5.159265382427163]
We propose a new task called CIGLI: Conditional Image Generation from Language and Image.
Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt.
arXiv Detail & Related papers (2021-08-20T00:58:42Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context.
The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text.
We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.