Related papers: Estimating the Error of Large Language Models at Pairwise Text Comparison

Estimating the Error of Large Language Models at Pairwise Text Comparison

URL: http://arxiv.org/abs/2510.22219v1
Date: Sat, 25 Oct 2025 08:39:52 GMT
Title: Estimating the Error of Large Language Models at Pairwise Text Comparison
Authors: Tianyi Li,
Abstract summary: Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts.
Score: 3.2650736290032865
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs' error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs' error. In general, the measured two positional bias terms are similar, close to the uniform error. Considering both the error rates and the robustness to the variation of prompts, Claude obtained the most desirable performance in this experiment. Our model outperforms the biased Bradley-Terry model and the commutativity score in indicating LLMs' error at this task.

Related papers

LLM-as-Judge on a Budget [35.393598355979385]
We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities.<n>Our algorithm is shown to achieve a worst-case score-estimation error of $tildeOleft(sqrtfracsum_i=1K_i2Bright)$.<n>Experiments on emphSummarize-From-Feedback and emphHelpSteer2 demonstrate that our method significantly outperforms uniform allocation.
arXiv Detail & Related papers (2026-02-17T10:35:41Z)
Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z)
Measuring Scalar Constructs in Social Science with LLMs [48.92998035333579]
We evaluate approaches to measuring scalar constructs in large language models.<n>We find that pairwise comparisons produce better measurements than simply prompting the LLM to directly output the scores.<n>Finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.
arXiv Detail & Related papers (2025-09-03T08:19:13Z)
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap [39.90675202514829]
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale.<n>We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations.
arXiv Detail & Related papers (2025-08-18T10:14:20Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Bayesian Calibration of Win Rate Estimation with LLM Evaluators [20.588104799661014]
We propose two methods to improve the accuracy of win rate estimation using large language models (LLMs) as evaluators. We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks.
arXiv Detail & Related papers (2024-11-07T04:32:40Z)
LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking [17.96316956366718]
Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. We show that sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate. We propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list.
arXiv Detail & Related papers (2024-05-31T23:29:42Z)
Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems. This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z)
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion [33.73671362609599]
Our framework consists of two modules: PairRanker and GenFuser. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. GenFuser aims to merge the top-ranked candidates, generating an improved output.
arXiv Detail & Related papers (2023-06-05T03:32:26Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.