Related papers: Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

URL: http://arxiv.org/abs/2601.12951v1
Date: Mon, 19 Jan 2026 10:58:24 GMT
Title: Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models
Authors: Felix Mächtle, Jan-Niclas Serr, Nils Loose, Thomas Eisenbarth,
Abstract summary: This paper investigates whether Large Language Models' code-comprehension performance aligns with traditional human-centric software metrics.<n>We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task.
Score: 4.841487377596519
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper investigates whether LLMs' code-comprehension performance aligns with traditional human-centric software metrics or instead reflects distinct, non-human regularities. We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task, enabling the evaluation of classification and generative models. Using a large-scale dataset, we correlate model performance with traditional, human-centric complexity metrics, such as lexical size, control-flow complexity, and abstract syntax tree structure. Our analyses reveal minimal correlation between human-defined metrics and LLM success (AUROC 0.63), while shadow models achieve substantially higher predictive performance (AUROC 0.86), capturing complex, partially predictable patterns beyond traditional software measures. These findings suggest that LLM comprehension reflects model-specific regularities only partially accessible through either human-designed or learned features, emphasizing the need for benchmark methodologies that move beyond aggregate accuracy and toward instance-level diagnostics, while acknowledging fundamental limits in predicting correct outcomes.

Related papers

LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Uncovering the Computational Ingredients of Human-Like Representations in LLMs [8.00888290370075]
It remains unclear which of these ingredients are most crucial for building models that develop human-like representations.<n>Most current benchmarks are not suited to measuring representational alignment between humans and models.
arXiv Detail & Related papers (2025-10-01T15:37:19Z)
Large Language Models as Universal Predictors? An Empirical Study on Small Tabular Datasets [0.0]
Large Language Models (LLMs) can perform predictive tasks over structured inputs without explicit fine-tuning on downstream tasks.<n>We investigate the empirical function approximation capability of LLMs on small-scale structured datasets for classification, regression and clustering tasks.<n>Our findings suggest that LLMs can serve as general-purpose predictive engines for structured data, with clear strengths in classification and significant limitations in regression and clustering.
arXiv Detail & Related papers (2025-08-24T15:00:51Z)
Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning [12.054910727620154]
Eye-tracking data reveals valuable insights into users' cognitive states but is difficult to analyze due to its structured, non-linguistic nature.<n>This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals.
arXiv Detail & Related papers (2025-07-24T09:49:53Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios. As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths. This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals. Model-to-Match uses variable importance measurements to construct a distance metric. We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.