Diverse Image Captioning with Grounded Style
- URL: http://arxiv.org/abs/2205.01813v1
- Date: Tue, 3 May 2022 22:57:59 GMT
- Title: Diverse Image Captioning with Grounded Style
- Authors: Franz Klein, Shweta Mahajan, Stefan Roth
- Abstract summary: We propose COCO-based augmentations to obtain varied stylized captions from COCO annotations.
We encode the stylized information in the latent space of a Variational Autoencoder.
Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions.
- Score: 19.434931809979282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stylized image captioning as presented in prior work aims to generate
captions that reflect characteristics beyond a factual description of the scene
composition, such as sentiments. Such prior work relies on given sentiment
identifiers, which are used to express a certain global style in the caption,
e.g. positive or negative, however without taking into account the stylistic
content of the visual scene. To address this shortcoming, we first analyze the
limitations of current stylized captioning datasets and propose COCO
attribute-based augmentations to obtain varied stylized captions from COCO
annotations. Furthermore, we encode the stylized information in the latent
space of a Variational Autoencoder; specifically, we leverage extracted image
attributes to explicitly structure its sequential latent space according to
different localized style characteristics. Our experiments on the Senticap and
COCO datasets show the ability of our approach to generate accurate captions
with diversity in styles that are grounded in the image.
Related papers
- Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics [3.9717825324709413]
Style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting.
In this study, we propose a zero-shot scheme for image variation with coordinated semantics.
arXiv Detail & Related papers (2024-10-24T08:34:57Z) - Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [47.421888361871254]
Scene text images contain not only style information (font, background) but also content information (character, texture)
Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance.
We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
arXiv Detail & Related papers (2024-05-07T15:00:11Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with
Unpaired Stylistic Corpora [37.53634609063878]
We propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap)
A conditional variational auto-encoder is then used to automatically diverse stylistic patterns in latent space.
Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances.
arXiv Detail & Related papers (2023-08-02T13:33:20Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent
Experts [5.859294565508523]
A new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence.
We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling.
arXiv Detail & Related papers (2020-07-07T11:00:27Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.