Related papers: STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

URL: http://arxiv.org/abs/2508.12096v2
Date: Wed, 20 Aug 2025 09:52:00 GMT
Title: STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples
Authors: Haiquan Hu, Jiazhi Jiang, Shiyou Xu, Ruhan Zeng, Tian Wang,
Abstract summary: evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly.<n>We propose the textbfStructured textbfTransition textbfEvaluation textbfMethod (STEM) as a lightweight and interpretable evaluation framework.
Score: 3.41981716024098
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textit{significant transition samples} (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.

Related papers

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities [2.9203730377983654]
Existing benchmarks fail to reflect real-world use or isolate compliance from task success.<n>We introduce MOSAIC, a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints.<n>We show that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position.
arXiv Detail & Related papers (2026-01-26T15:02:15Z)
ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling [18.8099769877788]
ORGEval is a graph-theoretic evaluation framework for assessing Large Language Models' capabilities in formulating linear and mixed-integer linear programs.<n>We show that ORGEval can successfully detect model equivalence and produce 100% consistent results across random parameter configurations.<n>Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting.
arXiv Detail & Related papers (2025-10-31T16:35:52Z)
Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling [17.092510377905814]
evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs.<n>We propose a novel framework for aligning MLLM benchmark based on Structural Equation Modeling (SEM) to analyze and quantify the internal validity, dimensional separability, and contribution of benchmark components.<n> Experimental results demonstrate that the proposed benchmark exhibits stronger interpretability, reduced indicator redundancy, and clearer cognitive consistency compared to existing approaches.
arXiv Detail & Related papers (2025-06-13T08:04:56Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
We introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts.<n>We employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency.
arXiv Detail & Related papers (2024-12-17T18:12:47Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
An Examination of the Compositionality of Large Generative Vision-Language Models [7.639748270719836]
Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs.
arXiv Detail & Related papers (2023-08-21T06:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.