Controllable Image Captioning
- URL: http://arxiv.org/abs/2204.13324v1
- Date: Thu, 28 Apr 2022 07:47:49 GMT
- Title: Controllable Image Captioning
- Authors: Luka Maxwell
- Abstract summary: We introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics.
We propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art image captioners can generate accurate sentences to describe
images in a sequence to sequence manner without considering the controllability
and interpretability. This, however, is far from making image captioning widely
used as an image can be interpreted in infinite ways depending on the target
and the context at hand. Achieving controllability is important especially when
the image captioner is used by different people with different way of
interpreting the images. In this paper, we introduce a novel framework for
image captioning which can generate diverse descriptions by capturing the
co-dependence between Part-Of-Speech tags and semantics. Our model decouples
direct dependence between successive variables. In this way, it allows the
decoder to exhaustively search through the latent Part-Of-Speech choices, while
keeping decoding speed proportional to the size of the POS vocabulary. Given a
control signal in the form of a sequence of Part-Of-Speech tags, we propose a
method to generate captions through a Transformer network, which predicts words
based on the input Part-Of-Speech tag sequences. Experiments on publicly
available datasets show that our model significantly outperforms
state-of-the-art methods on generating diverse image captions with high
qualities.
Related papers
- An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment [6.977241620071544]
Multimodal large language models have fueled progress in image captioning.
In this work, we show that this ability can be re-purposed for audio captioning.
We introduce a novel methodology for bridging the audiovisual modality gap.
arXiv Detail & Related papers (2024-10-08T12:52:48Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Controllable Image Captioning via Prompting [9.935191668056463]
We show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles.
To be specific, we design a set of prompts to fine-tune the pre-trained image captioner.
In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts.
arXiv Detail & Related papers (2022-12-04T11:59:31Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - More Grounded Image Captioning by Distilling Image-Text Matching Model [56.79895670335411]
We propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN) as the effective knowledge distillation for more grounded image captioning.
The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module.
arXiv Detail & Related papers (2020-04-01T12:42:06Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.