Related papers: ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs

ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs

URL: http://arxiv.org/abs/2601.17399v1
Date: Sat, 24 Jan 2026 09:57:59 GMT
Title: ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs
Authors: Rui Fang, Jian Li, Wei Chen, Bin Hu, Ying-Cong Chen, Xin Tang, Liang Diao,
Abstract summary: We present ReLE, a scalable system designed to diagnose Capability Anisotropy.<n>We evaluate 304 models across a Domain $times$ Capability Symbolic matrix comprising 207,843 samples.
Score: 37.23311145049677
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70\% compared to full-pass evaluations while maintaining a ranking correlation of $ρ=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.

Related papers

Beyond Accuracy: A Unified Random Matrix Theory Diagnostic Framework for Crash Classification Models [6.908972852063454]
We introduce a diagnostic framework grounded in Random Matrix Theory (RMT) and Heavy-Tailed Self-Regularization (HTSR)<n>We evaluate nine model families on two Iowa DOT crash classification tasks (173,512 and 371,062 records respectively)<n>We find that the power-law exponent $$ provides a structural quality signal: well-regularized models consistently yield $$ within $[2, 4]$ (mean $2.87 pm 0.34$)<n>We propose an $$-based early stopping criterion and a spectral model selection protocol, and validate both against cross-validated F
arXiv Detail & Related papers (2026-02-23T05:42:54Z)
From Global to Granular: Revealing IQA Model Performance via Correlation Surface [83.65597122328133]
We present textbfGranularity-Modulated Correlation (GMC), which provides a structured, fine-grained analysis of IQA performance.<n>GMC includes a textbfDistribution Regulator that regularizes correlations to mitigate biases from non-uniform quality distributions.<n>Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models.
arXiv Detail & Related papers (2026-01-29T13:55:26Z)
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics [23.99262273166077]
Large Language Models (LLMs) and diverse specialized benchmarks require a shift from fragmented, task-specific metrics to a holistic, competitive ranking system.<n>We introduce the novel Competitive Swiss-System Dynamics (CSD) framework, which simulates a sequential contest.<n>CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models.
arXiv Detail & Related papers (2025-12-24T07:14:31Z)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency [5.771786260272727]
We present a framework, BRACE, to benchmark Code Language Models on a unified scale of energy efficiency and functional correctness.<n>We propose two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER)<n>Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output.
arXiv Detail & Related papers (2025-11-10T23:44:48Z)
An Empirical Study of SOTA RCA Models: From Oversimplified Benchmarks to Realistic Failures [16.06503310632004]
We show that simple rule-based methods can match or even outperform state-of-the-art (SOTA) models on four widely used benchmarks.<n>Our analysis highlights three common failure patterns: scalability issues, observability blind spots, and modeling bottlenecks.
arXiv Detail & Related papers (2025-10-06T11:30:03Z)
CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts [78.79936076607373]
We introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify robustness of image classifiers for continuous and realistic nuisance shifts.<n>We propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models.
arXiv Detail & Related papers (2025-07-23T16:15:48Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
KAIROS: Scalable Model-Agnostic Data Valuation [8.766103946679435]
KAIROS is a scalable, model-agnostic valuation framework that assigns each example a distributional influence score.<n> KAIROS consistently outperforms state-of-the-art model-, Shapley-, and Wasserstein-based baselines in both accuracy and runtime.
arXiv Detail & Related papers (2025-06-30T12:44:28Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Benign Overfitting in Out-of-Distribution Generalization of Linear Models [19.203753135860016]
We take an initial step towards understanding benign overfitting in the Out-of-Distribution (OOD) regime.<n>We provide non-asymptotic guarantees proving that benign overfitting occurs in standard ridge regression.<n>We also present theoretical results for a more general family of target covariance matrix.
arXiv Detail & Related papers (2024-12-19T02:47:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.