Related papers: Explainable Benchmarking through the Lense of Concept Learning

Explainable Benchmarking through the Lense of Concept Learning

URL: http://arxiv.org/abs/2510.20439v1
Date: Thu, 23 Oct 2025 11:20:20 GMT
Title: Explainable Benchmarking through the Lense of Concept Learning
Authors: Quannian Zhang, Michael Röder, Nikit Srivastava, N'Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo,
Abstract summary: This paper argues for a new type of benchmarking, which is dubbed explainable benchmarking.<n>The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark.<n>We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL.
Score: 5.957919622462012
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80\% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025

Related papers

Easy Data Unlearning Bench [53.1304932656586]
We introduce a unified and benchmarking suite that simplifies the evaluation of unlearning algorithms.<n>By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods.
arXiv Detail & Related papers (2026-02-18T12:20:32Z)
Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study [7.0773305889955616]
Large Language Models (LLMs) have shown impressive performance in code generation.<n>LLMs must understand and apply a wide range of language concepts.<n>If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete.
arXiv Detail & Related papers (2026-01-07T10:23:33Z)
Uncovering Competency Gaps in Large Language Models and Their Benchmarks [11.572508874955659]
We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps.<n>We found that models consistently underperformed on concepts that stand in contrast to sycophantic behaviors.<n>Our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores.
arXiv Detail & Related papers (2025-12-06T17:39:47Z)
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks [34.09939383415074]
Benchmark Profiling decomposes benchmark performance into ten cognitively grounded abilities.<n>It explains why performance gains do not always translate into user-perceived competence.
arXiv Detail & Related papers (2025-09-23T15:32:47Z)
Improving LLM Leaderboards with Psychometrical Methodology [0.0]
The rapid development of large language models (LLMs) has necessitated the creation of benchmarks to evaluate their performance.<n>These benchmarks resemble human tests and surveys, as they consist of questions designed to measure emergent properties in the cognitive behavior of these systems.<n>However, unlike the well-defined traits and abilities studied in social sciences, the properties measured by these benchmarks are often vaguer and less rigorously defined.
arXiv Detail & Related papers (2025-01-27T21:21:46Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use [49.574651930395305]
VisIT-Bench is a benchmark for evaluation of instruction-following vision-language models. Our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. We quantify quality gaps between models and references using both human and automatic evaluations.
arXiv Detail & Related papers (2023-08-12T15:27:51Z)
Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z)
COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization [84.70895015194188]
We propose a Contrastive Learning based re-ranking framework for one-stage summarization called COLO. COLO boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score.
arXiv Detail & Related papers (2022-09-29T06:11:21Z)
Benchmarking Node Outlier Detection on Graphs [90.29966986023403]
Graph outlier detection is an emerging but crucial machine learning task with numerous applications. We present the first comprehensive unsupervised node outlier detection benchmark for graphs called UNOD.
arXiv Detail & Related papers (2022-06-21T01:46:38Z)
The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process. We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z)
Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies. We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.