Large Language Models Assisting Ontology Evaluation
- URL: http://arxiv.org/abs/2507.14552v1
- Date: Sat, 19 Jul 2025 09:13:51 GMT
- Title: Large Language Models Assisting Ontology Evaluation
- Authors: Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskisärkkä, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese,
- Abstract summary: Ontology evaluation through functional requirements is costly, labour-intensive, and error-prone.<n>We introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated verification.
- Score: 1.099532646524593
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Ontology evaluation through functional requirements, such as testing via competency question (CQ) verification, is a well-established yet costly, labour-intensive, and error-prone endeavour, even for ontology engineering experts. In this work, we introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated CQ verification. By presenting and leveraging a dataset of 1,393 CQs paired with corresponding ontologies and ontology stories, our contributions present, to our knowledge, the first systematic investigation into large language model (LLM)-assisted ontology evaluation, and include: (i) evaluating the effectiveness of a LLM-based approach for automatically performing CQ verification against a manually created gold standard, and (ii) developing and assessing an LLM-powered framework to assist CQ verification with Prot\'eg\'e, by providing suggestions. We found that automated LLM-based evaluation with o1-preview and o3-mini perform at a similar level to the average user's performance.
Related papers
- Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models [6.0158981171030685]
Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose underlying assumptions.<n>Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards.<n>This paper presents a comprehensive approach to support the development and benchmarking of systems for this task.
arXiv Detail & Related papers (2025-05-16T15:08:04Z) - The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.<n>This approach was originally developed for the TREC Question Answering (QA) Track in 2003.<n>We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z) - Ontology Generation using Large Language Models [1.0037949839020768]
We present and evaluate two new prompting techniques for automated ontology development: Memoryless CQbyCQ and Ontogenia.<n>Trials show that OpenAI o1-preview with Ontogenia produces of sufficient quality to meet the requirements of engineers.
arXiv Detail & Related papers (2025-03-07T13:03:28Z) - PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
Large language models (LLMs) are increasingly critical and challenging to evaluate.<n>Manual evaluation, while comprehensive, is often costly and resource-intensive.<n>automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
arXiv Detail & Related papers (2025-03-04T07:40:02Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition [4.125007507808684]
This study proposes an effective AIGC quality assessment (QA) framework.
First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder.
Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features.
arXiv Detail & Related papers (2024-04-20T07:05:45Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.