Related papers: Iconographic Image Captioning for Artworks

Iconographic Image Captioning for Artworks

URL: http://arxiv.org/abs/2102.03942v1
Date: Sun, 7 Feb 2021 23:11:33 GMT
Title: Iconographic Image Captioning for Artworks
Authors: Eva Cetinic
Abstract summary: This work utilizes a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task. A transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset. The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre.
Score: 2.3859169601259342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the domain of art historical data. In this particular context, the task of image captioning is confronted with various challenges such as the lack of large-scale datasets of image-text pairs, the complexity of meaning associated with describing artworks and the need for expert-level annotations. This work aims to address some of those challenges by utilizing a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task. Motivated by the state-of-the-art results achieved in generating captions for natural images, a transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset. Quantitative evaluation of the results is performed using standard image captioning metrics. The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre. The overall results suggest that the model can generate meaningful captions that exhibit a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

Related papers

Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z)
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension. This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z)
CapText: Large Language Model-based Caption Generation From Image Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone. Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z)
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts. Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model. We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z)
Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts. These models have been trained using text data collected from content-based labelling protocols. We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Partially-supervised novel object captioning leveraging context from paired data [11.215352918313577]
We create synthetic paired captioning data for novel objects by leveraging context from existing image-caption pairs. We further re-use these partially paired images with novel objects to create pseudo-label captions. Our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split.
arXiv Detail & Related papers (2021-09-10T21:31:42Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions. In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.