LLM-based Triplet Extraction for Automated Ontology Generation in Software Engineering Standards
- URL: http://arxiv.org/abs/2509.00140v1
- Date: Fri, 29 Aug 2025 17:14:54 GMT
- Title: LLM-based Triplet Extraction for Automated Ontology Generation in Software Engineering Standards
- Authors: Songhui Yue,
- Abstract summary: Software engineering standards (SES) consist of long, unstructured text (with high noise) and paragraphs with domain-specific terms.<n>This work proposes an open-source large language model (LLM)-assisted approach to RTE for SES.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ontologies have supported knowledge representation and whitebox reasoning for decades; thus, the automated ontology generation (AOG) plays a crucial role in scaling their use. Software engineering standards (SES) consist of long, unstructured text (with high noise) and paragraphs with domain-specific terms. In this setting, relation triple extraction (RTE), together with term extraction, constitutes the first stage toward AOG. This work proposes an open-source large language model (LLM)-assisted approach to RTE for SES. Instead of solely relying on prompt-engineering-based methods, this study promotes the use of LLMs as an aid in constructing ontologies and explores an effective AOG workflow that includes document segmentation, candidate term mining, LLM-based relation inference, term normalization, and cross-section alignment. Golden-standard benchmarks at three granularities are constructed and used to evaluate the ontology generated from the study. The results show that it is comparable and potentially superior to the OpenIE method of triple extraction.
Related papers
- LLM-Driven Ontology Construction for Enterprise Knowledge Graphs [0.0]
This paper introduces OntoEKG, a pipeline designed to accelerate the generation of domain-specific unstructured from enterprise data.<n>Our approach decomposes the modelling task into two distinct phases: an extraction module that identifies core classes and properties, and an entailment module that logically these elements into a hierarchy before serialising them into standard RDF.<n>Addressing the significant lack of comprehensive benchmarks for end-to-end construction, we adopt a new evaluation dataset derived from documents across the Data, Finance, and Logistics sectors.
arXiv Detail & Related papers (2026-02-01T15:13:30Z) - From Prompt to Graph: Comparing LLM-Based Information Extraction Strategies in Domain-Specific Ontology Development [14.475791894420666]
Ontologies are essential for structuring domain knowledge, improving accessibility, sharing, and reuse.<n>Traditional ontologies rely on manual annotation and conventional natural language processing (NLP) techniques.<n>The rise of Large Language Models (LLMs) offers new possibilities for automating knowledge extraction.<n>This study investigates three LLM-based approaches, including pre-trained LLM-driven method, in-context learning (ICL) method and fine-tuning method to extract terms and relations from domain-specific texts.
arXiv Detail & Related papers (2026-01-31T12:50:23Z) - Improving LLM-based Ontology Matching with fine-tuning on synthetic data [0.0]
Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines.<n>This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments.<n>A dedicated fine-tuning strategy can enhance the model's matching performance in a zero-shot setting.
arXiv Detail & Related papers (2025-11-27T16:46:45Z) - Executable Knowledge Graphs for Replicating AI Research [65.41207324831583]
Executable Knowledge Graphs (xKG) is a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.<n>Code will released at https://github.com/zjunlp/xKG.
arXiv Detail & Related papers (2025-10-20T17:53:23Z) - Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering [59.54662810933882]
Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models, often lack coherence and granularity.<n>We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering.
arXiv Detail & Related papers (2025-09-23T15:12:58Z) - Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking [5.783497520591236]
ARCE (augmented RoBERTa with contextualized elucidations) is a novel approach that systematically explores and optimize this generation process.<n>ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%.<n>This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task.
arXiv Detail & Related papers (2025-08-10T10:49:48Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - Retrieval Augmented Generation for Topic Modeling in Organizational Research: An Introduction with Empirical Demonstration [0.0]
This paper introduces Agentic Retrieval-Augmented Generation (Agentic RAG) as a method for topic modeling with LLMs.<n>It integrates three key components: (1) retrieval, enabling automatized access to external data beyond an LLM's pre-trained knowledge; (2) generation, leveraging LLM capabilities for text synthesis; and (3) agent-driven learning, iteratively refining retrieval and query formulation processes.<n>Our findings demonstrate that the approach is more efficient, interpretable and at the same time achieves higher reliability and validity in comparison to the standard machine learning approach.
arXiv Detail & Related papers (2025-02-28T11:25:11Z) - Automating Intervention Discovery from Scientific Literature: A Progressive Ontology Prompting and Dual-LLM Framework [56.858564736806414]
This paper proposes a novel framework leveraging large language models (LLMs) to identify interventions in scientific literature.<n>Our approach successfully identified 2,421 interventions from a corpus of 64,177 research articles in the speech-language pathology domain.
arXiv Detail & Related papers (2024-08-20T16:42:23Z) - Integrating Ontology Design with the CRISP-DM in the context of Cyber-Physical Systems Maintenance [41.85920785319125]
The proposed method is divided into three phases.
In phase one, ontology requirements are systematically specified, defining the relevant knowledge scope.
In phase two, CPS life cycle data is contextualized using domain-specific ontological artifacts.
This formalized domain knowledge is then utilized in the Cross-Industry Standard Process for Data Mining (CRISP-DM) to efficiently extract new insights from the data.
arXiv Detail & Related papers (2024-07-09T15:06:47Z) - Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction [12.455647753787442]
We propose a three-phase framework named Extract-Define-Canonicalize (EDC)
EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not.
We demonstrate EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works.
arXiv Detail & Related papers (2024-04-05T02:53:51Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - Pre-trained Language Models for Keyphrase Generation: A Thorough
Empirical Study [76.52997424694767]
We present an in-depth empirical study of keyphrase extraction and keyphrase generation using pre-trained language models.
We show that PLMs have competitive high-resource performance and state-of-the-art low-resource performance.
Further results show that in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models.
arXiv Detail & Related papers (2022-12-20T13:20:21Z) - Modeling Multi-Granularity Hierarchical Features for Relation Extraction [26.852869800344813]
We propose a novel method to extract multi-granularity features based solely on the original input sentences.
We show that effective structured features can be attained even without external knowledge.
arXiv Detail & Related papers (2022-04-09T09:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.