Related papers: Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

URL: http://arxiv.org/abs/2511.10354v1
Date: Fri, 14 Nov 2025 01:46:52 GMT
Title: Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates
Authors: Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp,
Abstract summary: This paper introduces ATR4CH, a systematic five-step methodology for Large Language Model-based Knowledge Extraction.<n>We validate the methodology through a case study on authenticity assessment debates.<n>ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable Knowledge Graphs.
Score: 0.3033221007650832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

Related papers

Cultural Analytics for Good: Building Inclusive Evaluation Frameworks for Historical IR [6.217528366941651]
This work bridges the fields of information retrieval and cultural analytics to support equitable access to historical knowledge.<n>Using the British Library BL19 digital collection, we construct a benchmark for studying changes in language, terminology and retrieval in the 19th-century fiction and non-fiction.
arXiv Detail & Related papers (2026-01-17T01:54:55Z)
Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation [3.141716989847573]
This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge.<n>The framework integrates Bloom's taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains.
arXiv Detail & Related papers (2025-11-03T15:04:23Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis [41.483432890962824]
We introduce Culture Synth, a novel framework for assessing large language models' cultural competence.<n>The Culture Synth-7 benchmark contains 19,360 entries and 4,149 manually verified entries across 7 languages.
arXiv Detail & Related papers (2025-09-13T16:33:56Z)
HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z)
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering [51.7493726399073]
We present a discourse-aware hierarchical framework to enhance long document question answering.<n>The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval.
arXiv Detail & Related papers (2025-05-26T14:45:12Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [97.31347312130119]
SciRIFF (Scientific Resource for Instruction-Following and Finetuning) is a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks.<n>These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification.<n> SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
Leveraging Large Language Models for Semantic Query Processing in a Scholarly Knowledge Graph [1.7418328181959968]
The proposed research aims to develop an innovative semantic query processing system. It enables users to obtain comprehensive information about research works produced by Computer Science (CS) researchers at the Australian National University.
arXiv Detail & Related papers (2024-05-24T09:19:45Z)
CRAFT: Extracting and Tuning Cultural Instructions from the Wild [38.255242754975654]
This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%.
arXiv Detail & Related papers (2024-05-06T03:21:55Z)
Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition. Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages. Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
O-Dang! The Ontology of Dangerous Speech Messages [53.15616413153125]
We present O-Dang!: The Ontology of Dangerous Speech Messages, a systematic and interoperable Knowledge Graph (KG) O-Dang! is designed to gather and organize Italian datasets into a structured KG, according to the principles shared within the Linguistic Linked Open Data community. It provides a model for encoding both gold standard and single-annotator labels in the KG.
arXiv Detail & Related papers (2022-07-13T11:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.