DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
- URL: http://arxiv.org/abs/2510.07959v1
- Date: Thu, 09 Oct 2025 08:53:59 GMT
- Title: DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
- Authors: Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh,
- Abstract summary: Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact.<n>We argue that promoting diversity among samples is not essential; what matters is to select samples thatmaximise diversity in model responses.<n>Our method, $textbfDiversifying Sample Condensation (DISCO)$, selects the top-k samples with the greatest model disagreements.
- Score: 59.01400190971061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.
Related papers
- Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation [19.158395403281734]
Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference?<n>This problem has broad applications in prompt tuning and chain-of-thought reasoning.<n>We show that the gradient estimation procedure yields approximations of full inference with less than $mathbf1%$ error across six datasets.
arXiv Detail & Related papers (2025-08-27T15:59:47Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.<n>We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.<n>Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$.
We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - PL-$k$NN: A Parameterless Nearest Neighbors Classifier [0.24499092754102875]
The $k$-Nearest Neighbors is one of the most effective and straightforward models employed in numerous problems.
This paper proposes a $k$-Nearest Neighbors classifier that bypasses the need to define the value of $k$.
arXiv Detail & Related papers (2022-09-26T12:52:45Z) - Common Failure Modes of Subcluster-based Sampling in Dirichlet Process
Gaussian Mixture Models -- and a Deep-learning Solution [5.822529963339041]
Dirichlet Process Gaussian Mixture Model (DPGMM) is often used to cluster data when the number of clusters is unknown.
One main DPGMM inference paradigm relies on sampling.
Here we consider a known state-of-art sampler, analyze its failure modes, and show how to improve it.
arXiv Detail & Related papers (2022-03-25T14:12:33Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - A Provably Efficient Sample Collection Strategy for Reinforcement
Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.