Bench4KE: Benchmarking Automated Competency Question Generation
- URL: http://arxiv.org/abs/2505.24554v2
- Date: Wed, 04 Jun 2025 09:08:05 GMT
- Title: Bench4KE: Benchmarking Automated Competency Question Generation
- Authors: Anna Sofia Lippolis, Minh Davide Ragagni, Paolo Ciancarini, Andrea Giovanni Nuzzolese, Valentina Presutti,
- Abstract summary: Bench4KE is an API-based benchmarking system for Knowledge Engineering automation.<n>It provides a curated gold standard consisting of CQ datasets from four real-world ontology projects.<n>It uses a suite of similarity metrics to assess the quality of the CQs generated.
- Score: 1.2512982702508668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation, a trend already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs). However, the evaluation of these tools lacks standardisation. This undermines the methodological rigour and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. Its first release focuses on evaluating tools that generate CQs automatically. CQs are natural language questions used by ontology engineers to define the functional requirements of an ontology. Bench4KE provides a curated gold standard consisting of CQ datasets from four real-world ontology projects. It uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of four recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.
Related papers
- Collaborative LLM Agents for C4 Software Architecture Design Automation [0.0]
This study contributes to automated software architecture design and its evaluation methods.<n>We introduce an LLM-based multi-agent system that automates the production of a C4 software architecture model.<n>Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity.
arXiv Detail & Related papers (2025-10-26T18:43:59Z) - AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment [69.06977852423564]
Image quality assessment (IQA) reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.<n>AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution.<n>To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents.
arXiv Detail & Related papers (2025-09-30T09:37:01Z) - SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z) - The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z) - AMAQA: A Metadata-based QA Dataset for RAG Systems [7.882922366782987]
We present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata.<n>AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups.<n>We show that leveraging metadata boosts accuracy from 0.12 to 0.61, highlighting the value of structured context.
arXiv Detail & Related papers (2025-05-19T08:59:08Z) - Architecture for a Trustworthy Quantum Chatbot [0.0]
This article describes the latest version (2.0) of C4Q, which delivers several enhancements.<n>C4Q 2.0's classification LLM reaches near-perfect accuracy.<n>The evaluation consists in a comparative study with three existing chatbots highlighting C4Q 2.0's maintainability and correctness.
arXiv Detail & Related papers (2025-03-06T16:43:23Z) - iTRI-QA: a Toolset for Customized Question-Answer Dataset Generation Using Language Models for Enhanced Scientific Research [1.2411445143550854]
We present a tool for the development of a customized question-answer (QA) dataset, called Interactive Trained Research Innovator (iTRI) - QA.<n>Our approach integrates curated QA datasets with a specialized research paper dataset to enhance responses' contextual relevance and accuracy using fine-tuned LM.<n>This pipeline provides a dynamic and domain-specific QA system that will be applied for future research LM deployment.
arXiv Detail & Related papers (2025-01-27T23:38:39Z) - Discerning and Characterising Types of Competency Questions for Ontologies [0.4757470449749875]
Competency Questions (CQs) are widely used in ontology development by guiding, among others, the scoping and validation stages.<n>Very limited guidance exists for formulating CQs and assessing whether they are good CQs, leading to issues such as ambiguity and unusable formulations.<n>This paper contributes to such theoretical foundations by analysing questions, their uses, and the myriad of development tasks.
arXiv Detail & Related papers (2024-12-18T10:26:29Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Improving Text Matching in E-Commerce Search with A Rationalizable,
Intervenable and Fast Entity-Based Relevance Model [78.80174696043021]
We propose a novel model called the Entity-Based Relevance Model (EBRM)
The decomposition allows us to use a Cross-encoder QE relevance module for high accuracy.
We also show that pretraining the QE module with auto-generated QE data from user logs can further improve the overall performance.
arXiv Detail & Related papers (2023-07-01T15:44:53Z) - PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question
Answering Research and Development [24.022050096797606]
PRIMEQA is a one-stop QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods.
It supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.
It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods.
arXiv Detail & Related papers (2023-01-23T20:43:26Z) - Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents.
This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models.
We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z) - Generative Language Models for Paragraph-Level Question Generation [79.31199020420827]
Powerful generative models have led to recent progress in question generation (QG)
It is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches.
We introduce QG-Bench, a benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting.
arXiv Detail & Related papers (2022-10-08T10:24:39Z) - Retrieving and Reading: A Comprehensive Survey on Open-domain Question
Answering [62.88322725956294]
We review the latest research trends in OpenQA, with particular attention to systems that incorporate neural MRC techniques.
We introduce modern OpenQA architecture named Retriever-Reader'' and analyze the various systems that follow this architecture.
We then discuss key challenges to developing OpenQA systems and offer an analysis of benchmarks that are commonly used.
arXiv Detail & Related papers (2021-01-04T04:47:46Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.