Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image
Captioning
- URL: http://arxiv.org/abs/2302.02124v2
- Date: Wed, 29 Nov 2023 12:31:42 GMT
- Title: Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image
Captioning
- Authors: Jingqiang Chen
- Abstract summary: Coherent entity-aware multi-image captioning aims to generate coherent captions for neighboring images in a news document.
This paper proposes a coherent entity-aware multi-image captioning model by making use of coherence relationships.
- Score: 0.65268245109828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Coherent entity-aware multi-image captioning aims to generate coherent
captions for neighboring images in a news document. There are coherence
relationships among neighboring images because they often describe same
entities or events. These relationships are important for entity-aware
multi-image captioning, but are neglected in entity-aware single-image
captioning. Most existing work focuses on single-image captioning, while
multi-image captioning has not been explored before. Hence, this paper proposes
a coherent entity-aware multi-image captioning model by making use of coherence
relationships. The model consists of a Transformer-based caption generation
model and two types of contrastive learning-based coherence mechanisms. The
generation model generates the caption by paying attention to the image and the
accompanying text. The caption-caption coherence mechanism aims to render
entities in the caption of the image be also in captions of neighboring images.
The caption-image-text coherence mechanism aims to render entities in the
caption of the image be also in the accompanying text. To evaluate coherence
between captions, two coherence evaluation metrics are proposed. The new
dataset DM800K is constructed that has more images per document than two
existing datasets GoodNews and NYT800K, and is more suitable for multi-image
captioning. Experiments on three datasets show the proposed captioning model
outperforms 7 baselines according to BLUE, Rouge, METEOR, and entity precision
and recall scores. Experiments also show that the generated captions are more
coherent than that of baselines according to caption entity scores, caption
Rouge scores, the two proposed coherence evaluation metrics, and human
evaluations.
Related papers
- Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.