Embedding Arithmetic for Text-driven Image Transformation
- URL: http://arxiv.org/abs/2112.03162v1
- Date: Mon, 6 Dec 2021 16:51:50 GMT
- Title: Embedding Arithmetic for Text-driven Image Transformation
- Authors: Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk
- Abstract summary: Text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man.
Recent works aim at bridging this semantic gap embed images and text into a multimodal space.
We introduce the SIMAT dataset to evaluate the task of text-driven image transformation.
- Score: 48.7704684871689
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Latent text representations exhibit geometric regularities, such as the
famous analogy: queen is to king what woman is to man. Such structured semantic
relations were not demonstrated on image representations. Recent works aiming
at bridging this semantic gap embed images and text into a multimodal space,
enabling the transfer of text-defined transformations to the image modality.
We introduce the SIMAT dataset to evaluate the task of text-driven image
transformation. SIMAT contains 6k images and 18k "transformation queries" that
aim at either replacing scene elements or changing their pairwise
relationships. The goal is to retrieve an image consistent with the (source
image, transformation) query. We use an image/text matching oracle (OSCAR) to
assess whether the image transformation is successful. The SIMAT dataset will
be publicly available.
We use SIMAT to show that vanilla CLIP multimodal embeddings are not very
well suited for text-driven image transformation, but that a simple finetuning
on the COCO dataset can bring dramatic improvements. We also study whether it
is beneficial to leverage the geometric properties of pretrained universal
sentence encoders (FastText, LASER and LaBSE).
Related papers
- Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts.
Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning [46.60334745348141]
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text.
We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures.
Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
arXiv Detail & Related papers (2023-03-29T11:37:41Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Image Captioning through Image Transformer [29.91581534937757]
We introduce the textbftextitimage transformer, which consists of a modified encoding transformer and an implicit decoding transformer.
Our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
arXiv Detail & Related papers (2020-04-29T14:30:57Z) - SwapText: Image Based Texts Transfer in Scenes [13.475726959175057]
We present SwapText, a framework to transfer texts across scene images.
A novel text swapping network is proposed to replace text labels only in the foreground image.
The generated foreground image and background image are used to generate the word image by the fusion network.
arXiv Detail & Related papers (2020-03-18T11:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.