Related papers: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

URL: http://arxiv.org/abs/2507.17178v1
Date: Wed, 23 Jul 2025 03:52:24 GMT
Title: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
Authors: Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang,
Abstract summary: We introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text.<n>We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units.<n>To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection.
Score: 29.88977150203991
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/Lza12a/SKA-Bench.

Related papers

OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases [38.58409057214189]
textbftextscOneEval is a benchmark to assess the knowledge-intensive reasoning capabilities of Large Language Models (LLMs)<n>textscOneEval comprises 4,019 carefully curated instances and includes a challenging subset, textscOneEvaltextsubscriptHard, consisting of 1,285 particularly difficult cases.<n>We release the textscOneEval datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
arXiv Detail & Related papers (2025-06-14T17:16:05Z)
Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking [44.66045367454493]
This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality.<n>We introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.
arXiv Detail & Related papers (2024-12-31T03:20:22Z)
Reasoning Factual Knowledge in Structured Data with Large Language Models [26.00548862629018]
Large language models (LLMs) have made remarkable progress in various natural language processing tasks. Structured data possesses unique characteristics that differ from the unstructured texts used for pretraining. We propose a benchmark named StructFact to evaluate the structural reasoning capabilities of LLMs in inferring factual knowledge.
arXiv Detail & Related papers (2024-08-22T08:05:09Z)
How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs)<n>Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance.<n>We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z)
Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs [55.317267269115845]
Chain-of-Knowledge (CoK) is a comprehensive framework for knowledge reasoning. CoK includes methodologies for both dataset construction and model learning. We conduct extensive experiments with KnowReason.
arXiv Detail & Related papers (2024-06-30T10:49:32Z)
Large Language Models are Limited in Out-of-Context Knowledge Reasoning [65.72847298578071]
Large Language Models (LLMs) possess extensive knowledge and strong capabilities in performing in-context reasoning. This paper focuses on a significant aspect of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z)
A Knowledge-Injected Curriculum Pretraining Framework for Question Answering [70.13026036388794]
We propose a general Knowledge-Injected Curriculum Pretraining framework (KICP) to achieve comprehensive KG learning and exploitation for Knowledge-based question answering tasks. The KI module first injects knowledge into the LM by generating KG-centered pretraining corpus, and generalizes the process into three key steps. The KA module learns knowledge from the generated corpus with LM equipped with an adapter as well as keeps its original natural language understanding ability. The CR module follows human reasoning patterns to construct three corpora with increasing difficulties of reasoning, and further trains the LM from easy to hard in a curriculum manner.
arXiv Detail & Related papers (2024-03-11T03:42:03Z)
Can Language Models Act as Knowledge Bases at Scale? [24.99538360485476]
Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating responses to complex queries. Our research investigates whether LLMs can effectively store, recall, and reason with knowledge on a large scale comparable to latest knowledge bases (KBs) such as Wikidata.
arXiv Detail & Related papers (2024-02-22T04:20:14Z)
Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA) First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z)
Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models [49.23348672822087]
We propose Knowledge Crosswords, a benchmark consisting of incomplete knowledge networks bounded by structured factual constraints. The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA. We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords.
arXiv Detail & Related papers (2023-10-02T15:43:53Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.