Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application
- URL: http://arxiv.org/abs/2506.03180v1
- Date: Thu, 29 May 2025 14:49:24 GMT
- Title: Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application
- Authors: Jan Ignatowicz, Krzysztof Kutt, Grzegorz J. Nalepa,
- Abstract summary: Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections.<n>These collections are often enriched with metadata describing items but not exactly their contents.<n>We explore an integrated methodology of computer vision (CV), artificial intelligence (AI) and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.
- Score: 8.732274235941974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Digitizing cultural heritage collections has become crucial for preservation of historical artifacts and enhancing their availability to the wider public. Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections. Those collections are often enriched with metadata describing items but not exactly their contents. The Jagiellonian Digital Library, standing as a good example of such an effort, offers datasets accessible through protocols like OAI-PMH. Despite these improvements, metadata completeness and standardization continue to pose substantial obstacles, limiting the searchability and potential connections between collections. To deal with these challenges, we explore an integrated methodology of computer vision (CV), artificial intelligence (AI), and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.
Related papers
- Metadata Enrichment of Long Text Documents using Large Language Models [3.536523762475449]
In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020.<n>This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science.
arXiv Detail & Related papers (2025-06-26T00:55:47Z) - Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications [8.732274235941974]
We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections.<n>MEM combines fine-tuned computer vision models, large language models and structured knowledge graphs.<n>We release a dataset of digitized incunabula from the Jagiellonian Digital Library.
arXiv Detail & Related papers (2025-05-29T15:22:18Z) - Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence [88.74800617923083]
We introduce Granite Vision, a lightweight large language model with vision capabilities.<n>Our model is trained on a comprehensive instruction-following dataset.<n> Granite Vision achieves strong results in standard benchmarks related to visual document understanding.
arXiv Detail & Related papers (2025-02-14T05:36:32Z) - Making History Readable [0.0]
This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps.
We discuss the challenges with each collection and detail our approaches to address them.
Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.
arXiv Detail & Related papers (2024-11-26T17:06:58Z) - A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain [3.9519587827662397]
We focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.
We consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines.
arXiv Detail & Related papers (2024-11-06T07:54:10Z) - Unlocking Comics: The AI4VA Dataset for Visual Understanding [62.345344799258804]
This paper presents a novel dataset comprising Franco-Belgian comics from the 1950s annotated for tasks including depth estimation, semantic segmentation, saliency detection, and character identification.
It consists of two distinct and consistent styles and incorporates object concepts and labels taken from natural images.
By including such diverse information across styles, this dataset not only holds promise for computational creativity but also offers avenues for the digitization of art and storytelling innovation.
arXiv Detail & Related papers (2024-10-27T14:27:05Z) - Microsoft Cloud-based Digitization Workflow with Rich Metadata Acquisition for Cultural Heritage Objects [7.450700594277742]
We have developed a new digitization workflow with the Jagiellonian Library (JL)
The solution is based on easy-to-access technological solutions -- Microsoft cloud with MS Excel files interfaces, Office Script for metadata acquisition, MS 365 for storage -- that allows metadata acquisition by domain experts.
The ultimate goal is to create a knowledge graph that describes the analyzed holdings, linked to general knowledge bases, as well as to other cultural heritage collections.
arXiv Detail & Related papers (2024-07-09T15:49:47Z) - A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes.
It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning.
Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z) - Open Set Classification of Untranscribed Handwritten Documents [56.0167902098419]
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide.
The class or typology'' of a document is perhaps the most important tag to be included in the metadata.
The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images.
arXiv Detail & Related papers (2022-06-20T20:43:50Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - Object Retrieval and Localization in Large Art Collections using Deep
Multi-Style Feature Fusion and Iterative Voting [10.807131260367298]
We introduce an algorithm that allows users to search for image regions containing specific motifs or objects.
Our region-based voting with GPU-accelerated approximate nearest-neighbour search allows us to find and localize even small motifs within an extensive dataset in a few seconds.
arXiv Detail & Related papers (2021-07-14T18:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.