Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation
- URL: http://arxiv.org/abs/2601.09165v1
- Date: Wed, 14 Jan 2026 05:10:36 GMT
- Title: Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation
- Authors: Aaron R. Flouro, Shawn P. Chadwick,
- Abstract summary: We develop an axiomatic, operator-theoretic framework for multiteacher ensemble knowledge distillation.<n>Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles. Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies.
Related papers
- Adaptive Weighting in Knowledge Distillation: An Axiomatic Framework for Multi-Scale Teacher Ensemble Optimization [0.0]
This paper develops an operator-agnostic framework for adaptive weighting in knowledge distillation across three complementary scales: token, task, and context.<n>We establish existence and non-uniqueness of conforming operators, characterize convergence of gradient-based optimization under standard assumptions, analyze stability and robustness, and provide an abstract formulation of safety-constrained distillation.
arXiv Detail & Related papers (2026-01-25T17:09:50Z) - Recursive Meta-Distillation: An Axiomatic Framework for Iterative Knowledge Refinement [0.0]
We introduce an axiomatic and operator-theoretic framework for iterative knowledge distillation as a sequence of probability-distribution operators with explicit anchoring to base teachers.<n>Results provide a theoretical basis for understanding stability, bias-variance behavior, and failure modes in iterative and multi-teacher distillation under capacity constraints.
arXiv Detail & Related papers (2026-01-19T14:39:40Z) - Sparse Knowledge Distillation: A Mathematical Framework for Probability-Domain Temperature Scaling and Multi-Stage Compression [0.0]
We develop a unified theoretical framework for sparse knowledge distillation based on probability-domain softening operators.<n>We introduce an axiomatic definition of probability-domain softening operators based on ranking preservation, continuity, entropy monotonicity, identity, and boundary behavior.<n>Results provide theoretical grounding for black-box teacher distillation, partial-access settings such as top-$k$ truncation and text-only outputs, and privacy-equivalent model compression.
arXiv Detail & Related papers (2026-01-06T17:17:24Z) - Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits [72.0643009153473]
We show that expressive variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size.<n>As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions.<n>We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class.
arXiv Detail & Related papers (2026-01-05T08:04:33Z) - A General Weighting Theory for Ensemble Learning: Beyond Variance Reduction via Spectral and Geometric Structure [0.0]
This paper develops a general weighting theory for ensemble learning.<n>We formalize ensembles as linear operators acting on a hypothesis space.<n>We show how non-uniform, structured weights can outperform uniform averaging.
arXiv Detail & Related papers (2025-12-25T08:51:01Z) - Structured Basis Function Networks: Loss-Centric Multi-Hypothesis Ensembles with Controllable Diversity [46.60221265861393]
Existing approaches to predictive uncertainty rely on multi-hypothesis prediction, which promotes diversity but lacks principled aggregation.<n>The Structured Basis Function Network addresses this gap by linking multi-hypothesis prediction and ensembling through centroidal aggregation induced by Bregman divergences.<n>A tunable diversity mechanism provides parametric control of the bias-variance-diversity trade-off, connecting multi-hypothesis generalisation with loss-aware ensemble aggregation.
arXiv Detail & Related papers (2025-09-02T19:53:43Z) - Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention [28.17124843417577]
Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads.<n>We establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts.<n>We propose a novelemphactive-attention mechanism where we apply a non-linear activation function to the value matrix in the formula of self-attention.
arXiv Detail & Related papers (2024-10-15T03:06:37Z) - Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model.
Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z) - Nonparametric Partial Disentanglement via Mechanism Sparsity: Sparse
Actions, Interventions and Sparse Temporal Dependencies [58.179981892921056]
This work introduces a novel principle for disentanglement we call mechanism sparsity regularization.
We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors.
We show that the latent factors can be recovered by regularizing the learned causal graph to be sparse.
arXiv Detail & Related papers (2024-01-10T02:38:21Z) - Enriching Disentanglement: From Logical Definitions to Quantitative Metrics [59.12308034729482]
Disentangling the explanatory factors in complex data is a promising approach for data-efficient representation learning.
We establish relationships between logical definitions and quantitative metrics to derive theoretically grounded disentanglement metrics.
We empirically demonstrate the effectiveness of the proposed metrics by isolating different aspects of disentangled representations.
arXiv Detail & Related papers (2023-05-19T08:22:23Z) - ER: Equivariance Regularizer for Knowledge Graph Completion [107.51609402963072]
We propose a new regularizer, namely, Equivariance Regularizer (ER)
ER can enhance the generalization ability of the model by employing the semantic equivariance between the head and tail entities.
The experimental results indicate a clear and substantial improvement over the state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-24T08:18:05Z) - Optimal Online Generalized Linear Regression with Stochastic Noise and
Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise.
We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise.
We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z) - Scaling Ensemble Distribution Distillation to Many Classes with Proxy
Targets [12.461503242570643]
emphEnsemble Distribution Distillation is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble.
For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion.
Although theoretically, this criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high.
arXiv Detail & Related papers (2021-05-14T17:50:14Z) - GroupifyVAE: from Group-based Definition to VAE-based Unsupervised
Representation Disentanglement [91.9003001845855]
VAE-based unsupervised disentanglement can not be achieved without introducing other inductive bias.
We address VAE-based unsupervised disentanglement by leveraging the constraints derived from the Group Theory based definition as the non-probabilistic inductive bias.
We train 1800 models covering the most prominent VAE-based models on five datasets to verify the effectiveness of our method.
arXiv Detail & Related papers (2021-02-20T09:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.