Image Generation from Image Captioning -- Invertible Approach
- URL: http://arxiv.org/abs/2410.20171v1
- Date: Sat, 26 Oct 2024 13:02:58 GMT
- Title: Image Generation from Image Captioning -- Invertible Approach
- Authors: Nandakishore S Menon, Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi,
- Abstract summary: We train an invertible model that learns a one-to-one mapping between the image and text embeddings.
Once the invertible model is efficiently trained on one task, the image captioning, the same model can generate new images for a given text.
- Score: 0.0
- License:
- Abstract: Our work aims to build a model that performs dual tasks of image captioning and image generation while being trained on only one task. The central idea is to train an invertible model that learns a one-to-one mapping between the image and text embeddings. Once the invertible model is efficiently trained on one task, the image captioning, the same model can generate new images for a given text through the inversion process, with no additional training. This paper proposes a simple invertible neural network architecture for this problem and presents our current findings.
Related papers
- Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Towards a Unified Foundation Model: Jointly Pre-Training Transformers on
Unpaired Images and Text [93.11954811297652]
We design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads.
We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals.
Experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks.
arXiv Detail & Related papers (2021-12-14T00:20:55Z) - EdiBERT, a generative model for image editing [12.605607949417033]
EdiBERT is a bi-directional transformer trained in the discrete latent space built by a vector-quantized auto-encoder.
We show that the resulting model matches state-of-the-art performances on a wide variety of tasks.
arXiv Detail & Related papers (2021-11-30T10:23:06Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z) - Efficient Neural Architecture for Text-to-Image Synthesis [6.166295570030645]
We show that an effective neural architecture can achieve state-of-the-art performance using a single stage training with a single generator and a single discriminator.
Our work points a new direction for text-to-image research, which has not experimented with novel neural architectures recently.
arXiv Detail & Related papers (2020-04-23T19:33:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.