Related papers: Microsoft Cloud-based Digitization Workflow with Rich Metadata Acquisition for Cultural Heritage Objects

Microsoft Cloud-based Digitization Workflow with Rich Metadata Acquisition for Cultural Heritage Objects

URL: http://arxiv.org/abs/2407.06972v1
Date: Tue, 9 Jul 2024 15:49:47 GMT
Title: Microsoft Cloud-based Digitization Workflow with Rich Metadata Acquisition for Cultural Heritage Objects
Authors: Krzysztof Kutt, Jakub Gomułka, Luiz do Valle Miranda, Grzegorz J. Nalepa,
Abstract summary: We have developed a new digitization workflow with the Jagiellonian Library (JL) The solution is based on easy-to-access technological solutions -- Microsoft cloud with MS Excel files interfaces, Office Script for metadata acquisition, MS 365 for storage -- that allows metadata acquisition by domain experts. The ultimate goal is to create a knowledge graph that describes the analyzed holdings, linked to general knowledge bases, as well as to other cultural heritage collections.
Score: 7.450700594277742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In response to several cultural heritage initiatives at the Jagiellonian University, we have developed a new digitization workflow in collaboration with the Jagiellonian Library (JL). The solution is based on easy-to-access technological solutions -- Microsoft 365 cloud with MS Excel files as metadata acquisition interfaces, Office Script for validation, and MS Sharepoint for storage -- that allows metadata acquisition by domain experts (philologists, historians, philosophers, librarians, archivists, curators, etc.) regardless of their experience with information systems. The ultimate goal is to create a knowledge graph that describes the analyzed holdings, linked to general knowledge bases, as well as to other cultural heritage collections, so careful attention is paid to the high accuracy of metadata and proper links to external sources. The workflow has already been evaluated in two pilots in the DiHeLib project focused on digitizing the so-called "Berlin Collection" and in two workshops with international guests, which allowed for its refinement and confirmation of its correctness and usability for JL. As the proposed workflow does not interfere with existing systems or domain guidelines regarding digitization and basic metadata collection in a given institution (e.g., file type, image quality, use of Dublin Core/MARC-21), but extends them in order to enable rich metadata collection, not previously possible, we believe that it could be of interest to all GLAMs (galleries, libraries, archives, and museums).

Related papers

Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application [8.732274235941974]
Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections.<n>These collections are often enriched with metadata describing items but not exactly their contents.<n>We explore an integrated methodology of computer vision (CV), artificial intelligence (AI) and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.
arXiv Detail & Related papers (2025-05-29T14:49:24Z)
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs [3.68389405018277]
This demo paper reports on a new workflow textitGhostWriter that combines the use of Large Language Models and Knowledge Graphs to support navigation through collections.<n>Based on the tool-suite textitEverythingData at the backend, textitGhostWriter provides an interface that enables querying and chatting'' with a collection.
arXiv Detail & Related papers (2025-05-16T18:51:51Z)
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora [2.3251886193174114]
We present an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.
arXiv Detail & Related papers (2025-02-19T13:03:06Z)
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks [57.589795399265945]
We introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We also introduce BigDocs-Bench, a benchmark suite with 10 novel tasks. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o.
arXiv Detail & Related papers (2024-12-05T21:41:20Z)
Web Archives Metadata Generation with GPT-4o: Challenges and Insights [2.45723043286596]
This paper explores the use ofgpt-4o for metadata generation within the Web Singapore Archive. We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs. The study identifies key challenges including content inaccuracies, hallucinations, and translation issues, suggesting that Large Language Models (LLMs) should serve as complements rather than replacements for human cataloguers.
arXiv Detail & Related papers (2024-11-08T08:59:40Z)
A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain [3.9519587827662397]
We focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks. We consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines.
arXiv Detail & Related papers (2024-11-06T07:54:10Z)
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z)
Advancing Manuscript Metadata: Work in Progress at the Jagiellonian University [7.993453987882035]
Three Jagiellonian University units are collaborating to digitize cultural heritage documents, describe them in detail, and then integrate these descriptions into a linked data cloud. We present a report on the current status of the work, in which we outline the most important requirements for the data model under development. We make a detailed comparison with the two standards that are the most relevant from the point of view of collections: Europeana Data Model used in Europeana and Encoded Archival Description used in Kalliope.
arXiv Detail & Related papers (2024-07-09T15:52:06Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models [11.959445364035734]
80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. We introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous documents. Docs2KG generates a unified knowledge graph that represents the extracted key information.
arXiv Detail & Related papers (2024-06-05T05:35:59Z)
Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation. We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques. We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z)
Learning to Learn from APIs: Black-Box Data-Free Meta-Learning [95.41441357931397]
Data-free meta-learning (DFML) aims to enable efficient learning of new tasks by meta-learning from a collection of pre-trained models without access to the training data. Existing DFML work can only meta-learn from (i) white-box and (ii) small-scale pre-trained models. We propose a Bi-level Data-free Meta Knowledge Distillation (BiDf-MKD) framework to transfer more general meta knowledge from a collection of black-box APIs to one single model.
arXiv Detail & Related papers (2023-05-28T18:00:12Z)
LAVIS: A Library for Language-Vision Intelligence [98.88477610704938]
LAVIS is an open-source library for LAnguage-VISion research and applications. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets.
arXiv Detail & Related papers (2022-09-15T18:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.