Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores
- URL: http://arxiv.org/abs/2510.14966v1
- Date: Thu, 16 Oct 2025 17:59:25 GMT
- Title: Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores
- Authors: Zachary Robertson,
- Abstract summary: We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions.<n>We derive this clipped-linear evaluations from Gini entropy, yielding a box-constrained least-squares formulation that handles boundary saturation.
- Score: 3.959606869996232
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions. Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 = [0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95 [0.825, 2.252]). We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation. At 33% coverage, we achieve holdout RMSE $0.117 \pm 0.008$ while preserving agent rankings (Spearman $\rho = 0.972 \pm 0.015$), three times fewer evaluations than full dense. Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ($\rho = 0.872$) and consistent identity-link advantage. TVD-MI's geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.
Related papers
- Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z) - Almost Asymptotically Optimal Active Clustering Through Pairwise Observations [59.20614082241528]
We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses.<n>We establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the accuracy of the clustering.<n>We develop a computationally feasible variant of the Generalized Likelihood Ratio statistic and show that its performance gap to the lower bound can be accurately empirically estimated.
arXiv Detail & Related papers (2026-02-05T14:16:47Z) - Spectral Sentinel: Scalable Byzantine-Robust Decentralized Federated Learning via Sketched Random Matrix Theory on Blockchain [0.0]
Byzantine clients poison gradients under heterogeneous (Non-IID) data.<n>We propose Spectral Sentinel, a Byzantine detection and aggregation framework.<n>We implement the full system with blockchain integration on Polygon networks.
arXiv Detail & Related papers (2025-12-14T09:43:03Z) - Temporal Zoom Networks: Distance Regression and Continuous Depth for Efficient Action Localization [6.908972852063454]
Temporal action localization requires both precise boundary detection and computational efficiency.<n>We address this through two complementary innovations: Boundary Distance Regression (BDR) and Adaptive Temporal Refinement (ATR)<n>On THUMOS14, our method achieves 56.5% mAP@0.7 with 151G FLOPs, using 36% fewer FLOPs than ActionFormer++ (55.7% mAP@0.7 at 235G)
arXiv Detail & Related papers (2025-11-06T00:41:54Z) - Real-time nonlinear inversion of magnetic resonance elastography with operator learning [0.06797079068199119]
The oNLI framework enables real-time MRE inversion (30,000x speedup) of elastograms with comparable spatial accuracy to NLI.<n>A structural prior mechanism, analogous to Soft Prior Regularization in the MRE literature, was incorporated to improve spatial accuracy.
arXiv Detail & Related papers (2025-10-03T08:55:40Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems [3.215065407261898]
Multi-agent systems that combine large language models with external tools are rapidly transitioning from research laboratories into high-stakes domains.<n>This "Advanced" sequel fills that gap by providing an algorithmic instantiation or empirical evidence.<n>AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9%.
arXiv Detail & Related papers (2025-08-28T15:52:49Z) - ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction [25.85736569130897]
Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks.<n>We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs.<n>We propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs.
arXiv Detail & Related papers (2025-05-23T10:00:03Z) - Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning [15.776175440446414]
We introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation.<n>Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal.<n>We preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW.
arXiv Detail & Related papers (2024-11-04T18:54:39Z) - Foundation Models for Structural Health Monitoring [14.36493796970864]
We propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for Structural Health Monitoring.<n>We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training.<n>We showcase the effectiveness of our foundation models using data from three operational viaducts.
arXiv Detail & Related papers (2024-04-03T13:32:44Z) - Federated Learning Resilient to Byzantine Attacks and Data Heterogeneity [59.17297282373628]
This paper addresses Gradient learning (FL) in the context of malicious attacks on data.<n>We introduce a novel Average Robust Algorithm (RAGA) which uses the median for both convergence analysis and loss functions.
arXiv Detail & Related papers (2024-03-20T08:15:08Z) - DFedADMM: Dual Constraints Controlled Model Inconsistency for
Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network.
Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.