Related papers: Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

URL: http://arxiv.org/abs/2507.04377v2
Date: Sun, 03 Aug 2025 20:05:38 GMT
Title: Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions
Authors: Xiao Zhang, Johan Bos,
Abstract summary: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression.<n>Many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts.<n>We introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content.
Score: 7.8094805916085015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

Related papers

TokBench: Evaluating Your Visual Tokenizer before Visual Generation [75.38270351179018]
We analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs.<n>Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales.
arXiv Detail & Related papers (2025-05-23T17:52:16Z)
ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding [16.9945713458689]
ArtRAG is a novel framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation.<n>At inference time, a structured retriever selects semantically and topologically relevant subgraphs to guide generation.<n>Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines.
arXiv Detail & Related papers (2025-05-09T13:08:27Z)
Restoring Ancient Ideograph: A Multimodal Multitask Neural Network Approach [11.263700269889654]
This paper proposes a novel Multimodal Multitask Restoring Model (MMRM) to restore ancient texts. It combines context understanding with residual visual information from damaged ancient artefacts, enabling it to predict damaged characters and generate restored images simultaneously.
arXiv Detail & Related papers (2024-03-11T12:57:28Z)
Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision [5.517240672957627]
We propose a novel knowledge-aware artifact image synthesis approach that brings lost historical objects accurately into their visual forms. Compared to existing approaches, our proposed model produces higher-quality artifact images that align better with the implicit details and historical knowledge contained within written documents.
arXiv Detail & Related papers (2023-12-13T11:03:07Z)
(Re)framing Built Heritage through the Machinic Gaze [3.683202928838613]
We argue that the proliferation of machine learning and vision technologies create new scopic regimes for heritage. We introduce the term machinic gaze' to conceptualise the reconfiguration of heritage representation via AI models.
arXiv Detail & Related papers (2023-10-06T23:48:01Z)
Text-to-Image Generation for Abstract Concepts [76.32278151607763]
We propose a framework of Text-to-Image generation for Abstract Concepts (TIAC) The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. The concept-dependent form is retrieved from an LLM-extracted form pattern set.
arXiv Detail & Related papers (2023-09-26T02:22:39Z)
Exploring Affordance and Situated Meaning in Image Captions: A Multimodal Analysis [1.124958340749622]
We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Cue Gazeing, and Ecological Niche Association (ENA) Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance.
arXiv Detail & Related papers (2023-05-24T01:30:50Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts. These models have been trained using text data collected from content-based labelling protocols. We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z)
From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence. Research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks. We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.