Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection
- URL: http://arxiv.org/abs/2512.17630v1
- Date: Fri, 19 Dec 2025 14:33:14 GMT
- Title: Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection
- Authors: Menna Elgabry, Ali Hamdi,
- Abstract summary: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection.<n>Our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA.<n>Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent.
- Score: 0.2864713389096699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
Related papers
- Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models [0.0]
Large language models (LLMs) achieve strong aver- age performance yet remain unreliable at the instance level.<n>We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner.<n>The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors.
arXiv Detail & Related papers (2026-01-12T06:27:06Z) - Think Then Embed: Generative Context Improves Multimodal Embedding [51.76690812535934]
We propose a Think-Then-Embed (TTE) framework for Universal Multimodal Embeddings (UME), composed of a reasoner and an embedder.<n>By leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets.
arXiv Detail & Related papers (2025-10-06T16:53:56Z) - Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems [55.6590601898194]
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge.<n>Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model.<n>We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score.
arXiv Detail & Related papers (2025-09-30T01:25:19Z) - LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks.<n>We propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations.
arXiv Detail & Related papers (2025-07-31T00:35:45Z) - Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z) - Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models [67.62810111789338]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z) - DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance [11.753349115726952]
We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE)<n>Our approach involves: (1) clustering models based on response "fingerprints" patterns, (2) applying a quantile-based filtering mechanism, and (3) assigning adaptive weights to remaining models.<n>In experiments on the Massive Multitask Language Understanding (MMLU) benchmark, DFPE outperforms the best single model by 3% overall accuracy and 5% in discipline-level accuracy.
arXiv Detail & Related papers (2025-01-29T08:44:45Z) - Rational Tuning of LLM Cascades via Probabilistic Modeling [0.9208007322096532]
We present a probabilistic model for the joint performance distribution of a sequence of large language models (LLMs)<n>Compared to selecting confidence thresholds using Bayesian optimization, our Markov parametric-copula model yields more favorable error-cost trade-offs.<n>Our framework's inductive assumptions about the interactions between the error rates of different LLMs enhance sample efficiency.
arXiv Detail & Related papers (2025-01-16T07:58:33Z) - MM-R$^3$: On (In-)Consistency of Vision-Language Models (VLMs) [26.475993408532304]
We analyze performance of SoTA Vision Language Models on three tasks: Question Rephrasing, Image Restyling, and Context Reasoning.<n>Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.<n>We propose a simple yet effective mitigation strategy in the form of an adapter module trained to minimize inconsistency across prompts.
arXiv Detail & Related papers (2024-10-07T06:36:55Z) - Aligner: One Global Token is Worth Millions of Parameters When Aligning
Large Language Models [72.26732961610557]
We introduce Aligner, a novel.
Efficient Fine-Tuning (PEFT) method for aligning multi-billion- parameter-sized Large Language Models (LLMs)
We show that Aligner can still perform comparably well to state-of-the-art LLM adaptation methods like LoRA that require millions of parameters.
arXiv Detail & Related papers (2023-12-09T08:25:55Z) - SCALE: Synergized Collaboration of Asymmetric Language Translation
Engines [105.8983433641208]
We introduce a collaborative framework that connects compact Specialized Translation Models (STMs) and general-purpose Large Language Models (LLMs) as one unified translation engine.
By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM.
Our experiments show that SCALE significantly outperforms both few-shot LLMs (GPT-4) and specialized models (NLLB) in challenging low-resource settings.
arXiv Detail & Related papers (2023-09-29T08:46:38Z) - Generative Multimodal Entity Linking [24.322540112710918]
Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to referent entities from a knowledge base.
Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters.
We propose GEMEL, a Generative Multimodal Entity Linking framework based on Large Language Models (LLMs)
Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution.
arXiv Detail & Related papers (2023-06-22T07:57:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.