Related papers: OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning

OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning

URL: http://arxiv.org/abs/2505.11031v3
Date: Thu, 02 Oct 2025 11:25:50 GMT
Title: OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning
Authors: Xiao Zhang, Huiyuan Lai, Qianru Meng, Johan Bos,
Abstract summary: Large language models have demonstrated remarkable capabilities across a wide range of tasks, yet their ability to process structured symbolic knowledge remains underexplored.<n>We introduce OntoURL, the first comprehensive benchmark designed to evaluate LLMs' capabilities in handling formal and symbolic representations domain knowledge.<n>Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning.
Score: 12.649177588353382
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have demonstrated remarkable capabilities across a wide range of tasks, yet their ability to process structured symbolic knowledge remains underexplored. To address this gap, we propose a taxonomy of ontological capabilities and introduce OntoURL, the first comprehensive benchmark designed to systematically evaluate LLMs' capabilities in handling ontologies -- formal and symbolic representations of domain knowledge. Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning through 15 distinct tasks comprising 57,303 questions derived from 40 ontologies across 8 domains. Experiments with 20 open-source LLMs reveal significant performance differences across models, tasks, and domains, with current LLMs showing capabilities in understanding ontological knowledge but weaknesses in reasoning and learning tasks. Further experiments with few-shot and chain-of-thought prompting illustrate how different prompting strategies affect model performance. Additionally, a human evaluation reveals that LLMs outperform humans in understanding and reasoning tasks but fall short in most learning tasks. These findings highlight both the potential and limitations of LLMs in processing symbolic knowledge and establish OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.

Related papers

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
KnowLogic: A Benchmark for Commonsense Reasoning via Knowledge-Driven Data Synthesis [33.72114830484246]
We introduce KnowLogic, a benchmark generated through a knowledge-driven synthetic data strategy.<n>KnowLogic integrates diverse commonsense knowledge, plausible scenarios, and various types of logical reasoning.<n>Our benchmark consists of 3,000 bilingual (Chinese and English) questions across various domains.
arXiv Detail & Related papers (2025-03-08T13:40:10Z)
Reasoning Factual Knowledge in Structured Data with Large Language Models [26.00548862629018]
Large language models (LLMs) have made remarkable progress in various natural language processing tasks. Structured data possesses unique characteristics that differ from the unstructured texts used for pretraining. We propose a benchmark named StructFact to evaluate the structural reasoning capabilities of LLMs in inferring factual knowledge.
arXiv Detail & Related papers (2024-08-22T08:05:09Z)
CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge [44.59258397967782]
Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing tasks. We present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities. We find that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge.
arXiv Detail & Related papers (2024-07-30T05:40:32Z)
Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever [48.5585921817745]
Large Language Models (LLMs) are used to automate the knowledge tagging task. We show the strong performance of zero- and few-shot results over math questions knowledge tagging tasks. By proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs.
arXiv Detail & Related papers (2024-06-19T23:30:01Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Do LLMs Dream of Ontologies? [13.776194387957617]
Large Models Language (LLMs) have demonstrated remarkable memorization across diverse natural language processing tasks.<n>This paper investigates the extent to which general-purpose LLMs correctly reproduce concept identifier (ID)-label associations from publicly available resources.
arXiv Detail & Related papers (2024-01-26T15:10:23Z)
From Understanding to Utilization: A Survey on Explainability for Large Language Models [27.295767173801426]
This survey underscores the imperative for increased explainability in Large Language Models (LLMs) Our focus is primarily on pre-trained Transformer-based LLMs, which pose distinctive interpretability challenges due to their scale and complexity. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement.
arXiv Detail & Related papers (2024-01-23T16:09:53Z)
Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE. This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z)
Exploring the Cognitive Knowledge Structure of Large Language Models: An Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence. Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains. We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
Metacognitive Prompting Improves Understanding in Large Language Models [12.112914393948415]
We introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. We conduct experiments on four prevalent Large Language Models (LLMs) across ten natural language understanding (NLU) datasets. MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks.
arXiv Detail & Related papers (2023-08-10T05:10:17Z)
LLMs4OL: Large Language Models for Ontology Learning [0.0]
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL) LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. The evaluations encompass diverse genres of ontological knowledge, including lexicosemantic knowledge in WordNet, geographical knowledge in GeoNames, and medical knowledge in UMLS.
arXiv Detail & Related papers (2023-07-31T13:27:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.