Related papers: Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

URL: http://arxiv.org/abs/2510.14966v1
Date: Thu, 16 Oct 2025 17:59:25 GMT
Title: Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores
Authors: Zachary Robertson,
Abstract summary: We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions.<n>We derive this clipped-linear evaluations from Gini entropy, yielding a box-constrained least-squares formulation that handles boundary saturation.
Score: 3.959606869996232
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions. Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 = [0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95 [0.825, 2.252]). We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation. At 33% coverage, we achieve holdout RMSE $0.117 \pm 0.008$ while preserving agent rankings (Spearman $\rho = 0.972 \pm 0.015$), three times fewer evaluations than full dense. Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ($\rho = 0.872$) and consistent identity-link advantage. TVD-MI's geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.

Related papers

Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Almost Asymptotically Optimal Active Clustering Through Pairwise Observations [59.20614082241528]
We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses.<n>We establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the accuracy of the clustering.<n>We develop a computationally feasible variant of the Generalized Likelihood Ratio statistic and show that its performance gap to the lower bound can be accurately empirically estimated.
arXiv Detail & Related papers (2026-02-05T14:16:47Z)
Spectral Sentinel: Scalable Byzantine-Robust Decentralized Federated Learning via Sketched Random Matrix Theory on Blockchain [0.0]
Byzantine clients poison gradients under heterogeneous (Non-IID) data.<n>We propose Spectral Sentinel, a Byzantine detection and aggregation framework.<n>We implement the full system with blockchain integration on Polygon networks.
arXiv Detail & Related papers (2025-12-14T09:43:03Z)
Temporal Zoom Networks: Distance Regression and Continuous Depth for Efficient Action Localization [6.908972852063454]
Temporal action localization requires both precise boundary detection and computational efficiency.<n>We address this through two complementary innovations: Boundary Distance Regression (BDR) and Adaptive Temporal Refinement (ATR)<n>On THUMOS14, our method achieves 56.5% mAP@0.7 with 151G FLOPs, using 36% fewer FLOPs than ActionFormer++ (55.7% mAP@0.7 at 235G)
arXiv Detail & Related papers (2025-11-06T00:41:54Z)
Real-time nonlinear inversion of magnetic resonance elastography with operator learning [0.06797079068199119]
The oNLI framework enables real-time MRE inversion (30,000x speedup) of elastograms with comparable spatial accuracy to NLI.<n>A structural prior mechanism, analogous to Soft Prior Regularization in the MRE literature, was incorporated to improve spatial accuracy.
arXiv Detail & Related papers (2025-10-03T08:55:40Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems [3.215065407261898]
Multi-agent systems that combine large language models with external tools are rapidly transitioning from research laboratories into high-stakes domains.<n>This "Advanced" sequel fills that gap by providing an algorithmic instantiation or empirical evidence.<n>AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9%.
arXiv Detail & Related papers (2025-08-28T15:52:49Z)
ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction [25.85736569130897]
Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks.<n>We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs.<n>We propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs.
arXiv Detail & Related papers (2025-05-23T10:00:03Z)
Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning [15.776175440446414]
We introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation.<n>Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal.<n>We preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW.
arXiv Detail & Related papers (2024-11-04T18:54:39Z)
Foundation Models for Structural Health Monitoring [14.36493796970864]
We propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for Structural Health Monitoring.<n>We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training.<n>We showcase the effectiveness of our foundation models using data from three operational viaducts.
arXiv Detail & Related papers (2024-04-03T13:32:44Z)
Federated Learning Resilient to Byzantine Attacks and Data Heterogeneity [59.17297282373628]
This paper addresses Gradient learning (FL) in the context of malicious attacks on data.<n>We introduce a novel Average Robust Algorithm (RAGA) which uses the median for both convergence analysis and loss functions.
arXiv Detail & Related papers (2024-03-20T08:15:08Z)
DFedADMM: Dual Constraints Controlled Model Inconsistency for Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network. Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z)
Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting. The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z)
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.