Learning Distinct and Representative Styles for Image Captioning
- URL: http://arxiv.org/abs/2209.08231v2
- Date: Tue, 15 Aug 2023 07:24:55 GMT
- Title: Learning Distinct and Representative Styles for Image Captioning
- Authors: Qi Chen, Chaorui Deng, Qi Wu
- Abstract summary: We propose a Discrete Mode Learning (DML) paradigm for image captioning.
Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings"
In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet.
- Score: 24.13549951795951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the years, state-of-the-art (SoTA) image captioning methods have
achieved promising results on some evaluation metrics (e.g., CIDEr). However,
recent findings show that the captions generated by these methods tend to be
biased toward the "average" caption that only captures the most general mode
(a.k.a, language pattern) in the training corpus, i.e., the so-called mode
collapse problem. Affected by it, the generated captions are limited in
diversity and usually less informative than natural image descriptions made by
humans. In this paper, we seek to avoid this problem by proposing a Discrete
Mode Learning (DML) paradigm for image captioning. Our innovative idea is to
explore the rich modes in the training caption corpus to learn a set of "mode
embeddings", and further use them to control the mode of the generated captions
for existing image captioning models. Specifically, the proposed DML optimizes
a dual architecture that consists of an image-conditioned discrete variational
autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC)
branch. The CdVAE branch maps each image caption to one of the mode embeddings
stored in a learned codebook, and is trained with a pure non-autoregressive
generation objective to make the modes distinct and representative. The MIC
branch can be simply modified from an existing image captioning model, where
the mode embedding is added to the original word embeddings as the control
signal. In the experiments, we apply the proposed DML to two widely used image
captioning models, Transformer and AoANet. The results show that the learned
mode embedding successfully facilitates these models to generate high-quality
image captions with different modes, further leading to better performance for
both diversity and quality on the MSCOCO dataset.
Related papers
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Cap2Aug: Caption guided Image to Image data Augmentation [41.53127698828463]
Cap2Aug is an image-to-image diffusion model-based data augmentation strategy using image captions as text prompts.
We generate captions from the limited training images and using these captions edit the training images using an image-to-image stable diffusion model.
This strategy generates augmented versions of images similar to the training images yet provides semantic diversity across the samples.
arXiv Detail & Related papers (2022-12-11T04:37:43Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent
Experts [5.859294565508523]
A new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence.
We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling.
arXiv Detail & Related papers (2020-07-07T11:00:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.