Show, Interpret and Tell: Entity-aware Contextualised Image Captioning
in Wikipedia
- URL: http://arxiv.org/abs/2209.10474v1
- Date: Wed, 21 Sep 2022 16:14:15 GMT
- Title: Show, Interpret and Tell: Entity-aware Contextualised Image Captioning
in Wikipedia
- Authors: Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis
Karatzas
- Abstract summary: We propose the novel task of captioning Wikipedia images by integrating contextual knowledge.
Specifically, we produce models that jointly reason over Wikipedia articles, Wikimedia images and their associated descriptions.
- Score: 10.21762162291523
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Humans exploit prior knowledge to describe images, and are able to adapt
their explanation to specific contextual information, even to the extent of
inventing plausible explanations when contextual information and images do not
match. In this work, we propose the novel task of captioning Wikipedia images
by integrating contextual knowledge. Specifically, we produce models that
jointly reason over Wikipedia articles, Wikimedia images and their associated
descriptions to produce contextualized captions. Particularly, a similar
Wikimedia image can be used to illustrate different articles, and the produced
caption needs to be adapted to a specific context, therefore allowing us to
explore the limits of a model to adjust captions to different contextual
information. A particular challenging task in this domain is dealing with
out-of-dictionary words and Named Entities. To address this, we propose a
pre-training objective, Masked Named Entity Modeling (MNEM), and show that this
pretext task yields an improvement compared to baseline models. Furthermore, we
verify that a model pre-trained with the MNEM objective in Wikipedia
generalizes well to a News Captioning dataset. Additionally, we define two
different test splits according to the difficulty of the captioning task. We
offer insights on the role and the importance of each modality and highlight
the limitations of our model. The code, models and data splits are publicly
available at Upon acceptance.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Paraphrase Acquisition from Image Captions [36.94459555199183]
We propose to use captions from the Web as a previously underutilized resource for paraphrases.
We analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles.
We introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different sources.
arXiv Detail & Related papers (2023-01-26T10:54:51Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Iconographic Image Captioning for Artworks [2.3859169601259342]
This work utilizes a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography.
The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task.
A transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset.
The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre.
arXiv Detail & Related papers (2021-02-07T23:11:33Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Understanding Guided Image Captioning Performance across Domains [22.283016988026926]
We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text.
Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets.
arXiv Detail & Related papers (2020-12-04T00:05:02Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.