Related papers: LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

URL: http://arxiv.org/abs/2512.21010v1
Date: Wed, 24 Dec 2025 07:14:31 GMT
Title: LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics
Authors: Jiashuo Liu, Jiayun Wu, Chunjie Wu, Jingkai Liu, Zaiyuan Wang, Huan Zhou, Wenhao Huang, Hongseok Namkoong,
Abstract summary: Large Language Models (LLMs) and diverse specialized benchmarks require a shift from fragmented, task-specific metrics to a holistic, competitive ranking system.<n>We introduce the novel Competitive Swiss-System Dynamics (CSD) framework, which simulates a sequential contest.<n>CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models.
Score: 23.99262273166077
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation ($N=100,000$ iterations) is used to approximate the statistically robust Expected Win Score ($E[S_m]$), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity ($T_k$), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.

Related papers

Discovering Process-Outcome Credit in Multi-Step LLM Reasoning [3.584086358722852]
Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs)<n>We propose a novel framework designed to provide continuous reward signals.<n>Our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.
arXiv Detail & Related papers (2026-02-01T05:44:09Z)
ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs [37.23311145049677]
We present ReLE, a scalable system designed to diagnose Capability Anisotropy.<n>We evaluate 304 models across a Domain $times$ Capability Symbolic matrix comprising 207,843 samples.
arXiv Detail & Related papers (2026-01-24T09:57:59Z)
Stable and Efficient Single-Rollout RL for Multimodal Reasoning [66.53652874617217]
$textbfMSSR$ (Multimodal Stabilized Single-Rollout) is a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance.<n>In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps.
arXiv Detail & Related papers (2025-12-20T05:07:53Z)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models [57.33350664910483]
We introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings.<n>We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios.
arXiv Detail & Related papers (2025-11-12T06:06:29Z)
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models [72.52332895840279]
GenCluster is a test-time compute framework that attains IOI gold-level performance using open-weight models.<n>We will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model.
arXiv Detail & Related papers (2025-10-16T02:19:25Z)
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment [1.8552770604791606]
We propose a hybrid reward modeling framework that integrates complementary reward paradigms.<n>We show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling.<n>Our best performing model in the 3B family achieves an overall average improvement of 9.5% across general and math reasoning tasks.
arXiv Detail & Related papers (2025-10-06T18:53:23Z)
Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis [17.98292973608615]
We propose a novel Uncertainty-Aware Collaborative System (U-ACS) that orchestrates a powerful MLLM and a lightweight baseline model for multimodal sentiment analysis.<n>Our proposed method achieves state-of-the-art performance, while requiring only a fraction of the computational resources compared to using a standalone MLLM.
arXiv Detail & Related papers (2025-08-27T16:01:58Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z)
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models [5.6525926183880255]
We introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task.<n>In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds.<n>TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains.
arXiv Detail & Related papers (2025-06-02T05:47:50Z)
Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework [19.558523263211942]
In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts.<n>In this study, we revisit mixture models for generating multimodal agent behaviors, which can cover the mainstream methods.<n>We introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts.
arXiv Detail & Related papers (2025-01-28T15:26:25Z)
Ensemble Methods for Sequence Classification with Hidden Markov Models [8.241486511994202]
We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs) HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets.
arXiv Detail & Related papers (2024-09-11T20:59:32Z)
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment. We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.