Diffusion Based Augmentation for Captioning and Retrieval in Cultural
Heritage
- URL: http://arxiv.org/abs/2308.07151v1
- Date: Mon, 14 Aug 2023 13:59:04 GMT
- Title: Diffusion Based Augmentation for Captioning and Retrieval in Cultural
Heritage
- Authors: Dario Cioni, Lorenzo Berlincioni, Federico Becattini, Alberto del
Bimbo
- Abstract summary: This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain.
By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions.
- Score: 28.301944852273746
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Cultural heritage applications and advanced machine learning models are
creating a fruitful synergy to provide effective and accessible ways of
interacting with artworks. Smart audio-guides, personalized art-related content
and gamification approaches are just a few examples of how technology can be
exploited to provide additional value to artists or exhibitions. Nonetheless,
from a machine learning point of view, the amount of available artistic data is
often not enough to train effective models. Off-the-shelf computer vision
modules can still be exploited to some extent, yet a severe domain shift is
present between art images and standard natural image datasets used to train
such models. As a result, this can lead to degraded performance. This paper
introduces a novel approach to address the challenges of limited annotated data
and domain shifts in the cultural heritage domain. By leveraging generative
vision-language models, we augment art datasets by generating diverse
variations of artworks conditioned on their captions. This augmentation
strategy enhances dataset diversity, bridging the gap between natural images
and artworks, and improving the alignment of visual cues with knowledge from
general-purpose datasets. The generated variations assist in training vision
and language models with a deeper understanding of artistic characteristics and
that are able to generate better captions with appropriate jargon.
Related papers
- KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph [24.586916324061168]
We present KALE Knowledge-Augmented vision-Language model for artwork Elaborations.
KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph.
Experimental results demonstrate that KALE achieves strong performance over existing state-of-the-art work across several artwork datasets.
arXiv Detail & Related papers (2024-09-17T06:39:18Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Language Does More Than Describe: On The Lack Of Figurative Speech in
Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts.
These models have been trained using text data collected from content-based labelling protocols.
We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Iconographic Image Captioning for Artworks [2.3859169601259342]
This work utilizes a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography.
The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task.
A transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset.
The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre.
arXiv Detail & Related papers (2021-02-07T23:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.