SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable
- URL: http://arxiv.org/abs/2503.10881v1
- Date: Thu, 13 Mar 2025 20:54:28 GMT
- Title: SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable
- Authors: Jiaxin Zhang, Zhuohang Li, Wendi Cui, Kamalika Das, Bradley malin, Sricharan Kumar,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks.<n>This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs.
- Score: 4.953092503184905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks. Ensembling multiple LLMs is a promising approach to generate reliable responses but conventional ensembling frameworks suffer from high computational overheads. This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs. The SCE framework systematically evaluates and integrates outputs to produce a cohesive result through two core components: SCE-CHECK, a mechanism that gauges the consistency between response pairs via semantic equivalence; and SCE-FUSION, which adeptly merges the highest-ranked consistent responses from SCE-CHECK, to optimize collective strengths and mitigating potential weaknesses. To improve the scalability with multiple inference queries, we further propose ``{You Only Prompt Once}'' (YOPO), a novel technique that reduces the inference complexity of pairwise comparison from quadratic to constant time. We perform extensive empirical evaluations on diverse benchmark datasets to demonstrate \methodName's effectiveness. Notably, the \saccheckcomponent outperforms conventional baselines with enhanced performance and a significant reduction in computational overhead.
Related papers
- Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing [35.686125031177234]
Multi-Document Summarization (MDS) is a challenging task that focuses on extracting and synthesizing useful information from multiple lengthy documents.
We propose a novel framework that leverages inference-time scaling for this task.
We also introduce two new evaluation metrics: Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (ACU) score.
arXiv Detail & Related papers (2025-02-27T23:34:47Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient [19.673388630963807]
We propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria.
Under this framework, we analyze the advantages and weaknesses of directly prompting large language models (LLMs) as generic benchmark generators.
We then introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker.
Experiments confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics.
arXiv Detail & Related papers (2025-02-02T06:36:01Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.
This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path.
The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z) - Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling [23.447466392929712]
Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks.
Existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities.
We introduce the textscUnion textscTop-$k$ textscEnsembling (textscUniTE), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model.
arXiv Detail & Related papers (2024-10-03T08:42:38Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning
Framework for Emotion Recognition in Conversations [20.856739541819056]
Emotion recognition in conversations (ERC) is a rapidly evolving task within the natural language processing community.
We propose an efficient and model-agnostic SCL framework named Supervised Sample-Label Contrastive Learning with Soft-HGR Maximal Correlation (SSLCL)
We introduce a novel perspective on utilizing label representations by projecting discrete labels into dense embeddings through a shallow multilayer perceptron.
arXiv Detail & Related papers (2023-10-25T14:41:14Z) - Modeling the Q-Diversity in a Min-max Play Game for Robust Optimization [61.39201891894024]
Group distributionally robust optimization (group DRO) can minimize the worst-case loss over pre-defined groups.
We reformulate the group DRO framework by proposing Q-Diversity.
Characterized by an interactive training mode, Q-Diversity relaxes the group identification from annotation into direct parameterization.
arXiv Detail & Related papers (2023-05-20T07:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.