Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent
Experts
- URL: http://arxiv.org/abs/2007.03338v1
- Date: Tue, 7 Jul 2020 11:00:27 GMT
- Title: Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent
Experts
- Authors: Marzieh Heidari, Mehdi Ghatee, Ahmad Nickabadi, Arash Pourhasan Nezhad
- Abstract summary: A new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence.
We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling.
- Score: 5.859294565508523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With great advances in vision and natural language processing, the generation
of image captions becomes a need. In a recent paper, Mathews, Xie and He [1],
extended a new model to generate styled captions by separating semantics and
style. In continuation of this work, here a new captioning model is developed
including an image encoder to extract the features, a mixture of recurrent
networks to embed the set of extracted features to a set of words, and a
sentence generator that combines the obtained words as a stylized sentence. The
resulted system that entitled as Mixture of Recurrent Experts (MoRE), uses a
new training algorithm that derives singular value decomposition (SVD) from
weighting matrices of Recurrent Neural Networks (RNNs) to increase the
diversity of captions. Each decomposition step depends on a distinctive factor
based on the number of RNNs in MoRE. Since the used sentence generator gives a
stylized language corpus without paired images, our captioning model can do the
same. Besides, the styled and diverse captions are extracted without training
on a densely labeled or styled dataset. To validate this captioning model, we
use Microsoft COCO which is a standard factual image caption dataset. We show
that the proposed captioning model can generate a diverse and stylized image
captions without the necessity of extra-labeling. The results also show better
descriptions in terms of content accuracy.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Learning Distinct and Representative Styles for Image Captioning [24.13549951795951]
We propose a Discrete Mode Learning (DML) paradigm for image captioning.
Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings"
In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet.
arXiv Detail & Related papers (2022-09-17T03:25:46Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.