Related papers: Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

URL: http://arxiv.org/abs/2602.08003v1
Date: Sun, 08 Feb 2026 15:05:22 GMT
Title: Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection
Authors: Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu,
Abstract summary: Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated.<n>We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models.<n>Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data.
Score: 8.266188814122605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.

Related papers

Model Class Selection [2.377712112950261]
We introduce the idea of model class selection (MCS)<n>In MCS, multiple model collections are evaluated, and all collections that contain at least one optimal model are sought for identification.<n>As a direct consequence, for particular datasets we are able to investigate whether classes of simpler and more interpretable statistical models are able to perform on par with more complex black-box machine learning models.
arXiv Detail & Related papers (2025-11-14T14:43:26Z)
Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design [20.31388126105889]
We propose M-DESIGN, a curated model knowledge base (MKB) pipeline for mastering neural network refinement.<n>First, we propose a knowledge weaving engine that reframes model refinement as an adaptive query problem over task metadata.<n>Given a user's task query, M-DESIGN quickly matches and iteratively refines candidate models by leveraging a graph-relational knowledge schema.
arXiv Detail & Related papers (2025-07-21T07:49:19Z)
Causal LLM Routing: End-to-End Regret Minimization from Observational Data [3.3580884064577616]
LLM routing aims to select the most appropriate model for each query.<n>Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates.<n>We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data.
arXiv Detail & Related papers (2025-05-21T21:34:18Z)
Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [84.57103623507082]
This paper introduces Model-GLUE, a holistic Large Language Models scaling guideline.<n>We benchmark existing scaling techniques, especially selective merging, and variants of mixture.<n>We then formulate an optimal strategy for the selection and aggregation of a heterogeneous model zoo.<n>Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters.
arXiv Detail & Related papers (2024-10-07T15:55:55Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Large Language Model Routing with Benchmark Datasets [40.42044096089315]
No single model typically achieves the best accuracy in all tasks and use cases. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this selection. We show that this problem can be reduced to a collection of binary classification tasks.
arXiv Detail & Related papers (2023-09-27T17:08:40Z)
MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training. Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [47.432215933099016]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.<n>This creates a barrier to fusing knowledge across individual models to yield a better single model.<n>We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks. We propose Sample-specific Ensemble of Source Models (SESoM) SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z)
Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.