Related papers: Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

URL: http://arxiv.org/abs/2512.24867v2
Date: Tue, 06 Jan 2026 09:20:46 GMT
Title: Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements
Authors: Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, Jiajun Zhang,
Abstract summary: Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
Score: 78.87065404966002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

Related papers

RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z)
Questionnaire meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses [3.8293581919117123]
Large language models (LLMs) excel at few-shot reasoning over open-ended text.<n>Current retrieval and survey analysis tools are typically designed for humans in the workflow.<n>We introduce QASU, a benchmark that probes six structural skills, including answer lookup, respondent count, and multi-hop inference.<n>Experiments show that choosing an effective format and prompt combination can improve accuracy by up to 8.8% points compared to suboptimal formats.
arXiv Detail & Related papers (2025-10-30T08:18:37Z)
Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z)
The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference [13.59675117792588]
Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters.<n>We interrogate this assumption by introducing a Clinical Trial Natural Language In Attribution benchmark comprising four reasoning families.<n>Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification probe, allowing us to dissociate failures of factual access from failures of inference.
arXiv Detail & Related papers (2025-08-14T16:01:10Z)
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking [63.84117489519164]
Knowledge Graph Question Answering systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning.<n>Despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues.<n>We introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls.<n>Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
arXiv Detail & Related papers (2025-05-29T14:44:52Z)
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models [36.10798324093408]
SAS-Bench is a benchmark for large language models (LLMs) based Short Answer Scoring tasks.<n>It provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types.<n>We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts.
arXiv Detail & Related papers (2025-05-12T05:43:21Z)
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts [20.901983944214532]
We introduce Sequential-NIAH, a benchmark designed to evaluate the capability of large language models to extract sequential information from long contexts.<n>The benchmark includes three needle generation pipelines: synthetic-temporal, real-temporal, and real-logical orders, with context lengths ranging from 8K to 128K.<n>We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.50% on test set of this benchmark.
arXiv Detail & Related papers (2025-04-07T03:50:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.