Related papers: Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

URL: http://arxiv.org/abs/2510.02306v1
Date: Thu, 02 Oct 2025 17:59:41 GMT
Title: Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation
Authors: Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp, Yao Lu,
Abstract summary: We examine whether a draw genuinely means that the two models are equal.<n>We conjecture that draws are more indicative of query difficulty.<n>We recommend future rating systems to reconsider existing draw semantics.
Score: 17.451562591754698
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

Related papers

Think Twice: Branch-and-Rethink Reasoning Reward Model [32.70732791642558]
We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling.<n>We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks.<n>By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors.
arXiv Detail & Related papers (2025-10-27T17:58:07Z)
Evaluating Language Models' Evaluations of Games [65.49017696754825]
We advocate for a new paradigm that assesses AI systems' evaluation of games.<n>We leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments.<n>Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models.
arXiv Detail & Related papers (2025-10-13T02:45:37Z)
What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking [50.72154186522052]
Large language models (LLMs) excel at processing information reactively but lack the ability to systemically explore hypothetical futures.<n>We propose WiA-LLM, a new paradigm that equips LLMs with proactive thinking capabilities.<n>We validate WiA-LLM in Honor of Kings, a complex multiplayer game environment.
arXiv Detail & Related papers (2025-09-05T04:05:27Z)
Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z)
Re-evaluating Open-ended Evaluation of Large Language Models [50.23008729038318]
We show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental.<n>We propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy.
arXiv Detail & Related papers (2025-02-27T15:07:47Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Chess Rating Estimation from Moves and Clock Times Using a CNN-LSTM [11.340099493701029]
We propose a method to estimate player ratings directly from game moves and clock times. Our model architecture comprises a CNN to learn positional features, which are integrated with clock-time data into a Bidirectional LSTM. This model is the first to use no hand-crafted features to estimate chess ratings and also the first to output a rating prediction after each move.
arXiv Detail & Related papers (2024-09-17T19:19:16Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
Action Quality Assessment using Siamese Network-Based Deep Metric Learning [7.945673227394573]
The proposed scoring model has been tested for Olympics Diving and Gymnastic vaults. The model outperforms the existing state-of-the-art scoring models.
arXiv Detail & Related papers (2020-02-27T14:00:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.