Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text
Retrieval
- URL: http://arxiv.org/abs/2207.14428v1
- Date: Fri, 29 Jul 2022 01:21:54 GMT
- Title: Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text
Retrieval
- Authors: Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
- Abstract summary: We propose a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model.
We generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module.
We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets.
- Score: 142.047662926209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates an open research problem of generating text-image
pairs to improve the training of fine-grained image-to-text cross-modal
retrieval task, and proposes a novel framework for paired data augmentation by
uncovering the hidden semantic information of StyleGAN2 model. Specifically, we
first train a StyleGAN2 model on the given dataset. We then project the real
images back to the latent space of StyleGAN2 to obtain the latent codes. To
make the generated images manipulatable, we further introduce a latent space
alignment module to learn the alignment between StyleGAN2 latent codes and the
corresponding textual caption features. When we do online paired data
augmentation, we first generate augmented text through random token
replacement, then pass the augmented text into the latent space alignment
module to output the latent codes, which are finally fed to StyleGAN2 to
generate the augmented images. We evaluate the efficacy of our augmented data
approach on two public cross-modal retrieval datasets, in which the promising
experimental results demonstrate the augmented text-image pair data can be
trained together with the original data to boost the image-to-text cross-modal
retrieval performance.
Related papers
- Generating Intermediate Representations for Compositional Text-To-Image Generation [16.757550214291015]
We propose a compositional approach for text-to-image generation based on two stages.
In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations conditioned on text.
In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model.
arXiv Detail & Related papers (2024-10-13T10:24:55Z) - Style Generation: Image Synthesis based on Coarsely Matched Texts [10.939482612568433]
We introduce a novel task called text-based style generation and propose a two-stage generative adversarial network.
The first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature.
The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization.
arXiv Detail & Related papers (2023-09-08T21:51:11Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork [38.55086153299993]
We develop an approach for text-to-image generation that embraces additional retrieval images, driven by a combination of implicit visual guidance loss and generative objectives.
We propose a novel hypernetwork modulated visual-text encoding scheme to predict the weight update of the encoding layer.
Experimental results show that our model guided with additional retrieval visual data outperforms existing GAN-based models.
arXiv Detail & Related papers (2022-08-17T19:25:00Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.