Related papers: Adaptive Weighting in Knowledge Distillation: An Axiomatic Framework for Multi-Scale Teacher Ensemble Optimization

Adaptive Weighting in Knowledge Distillation: An Axiomatic Framework for Multi-Scale Teacher Ensemble Optimization

URL: http://arxiv.org/abs/2601.17910v1
Date: Sun, 25 Jan 2026 17:09:50 GMT
Title: Adaptive Weighting in Knowledge Distillation: An Axiomatic Framework for Multi-Scale Teacher Ensemble Optimization
Authors: Aaron R. Flouro, Shawn P. Chadwick,
Abstract summary: This paper develops an operator-agnostic framework for adaptive weighting in knowledge distillation across three complementary scales: token, task, and context.<n>We establish existence and non-uniqueness of conforming operators, characterize convergence of gradient-based optimization under standard assumptions, analyze stability and robustness, and provide an abstract formulation of safety-constrained distillation.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation with multiple teachers is increasingly used to improve robustness, efficiency, and safety, yet existing approaches rely largely on heuristic or implementation-specific weighting schemes. This paper develops an operator-agnostic axiomatic framework for adaptive weighting in multi-teacher knowledge distillation across three complementary scales: token, task, and context. We formalize structural conditions under which adaptive weighting operators are well-defined, admit multiple non-equivalent implementations, and can be hierarchically composed via product-structure normalization. Within this framework, we establish existence and non-uniqueness of conforming operators, characterize convergence of gradient-based optimization under standard assumptions, analyze stability and perturbation robustness, and provide an abstract formulation of safety-constrained distillation. The results decouple theoretical guarantees from specific weighting formulas, enabling principled analysis of adaptive distillation methods under heterogeneity, distribution shift, and safety constraints.

Related papers

ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations [54.886931928255564]
Low-rank adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning method in deep transfer learning.<n>We propose a novel continuous-time optimization dynamic for LoRA factor matrices in the form of an ordinary differential equation (ODE)<n>We show that ODELoRA achieves stable feature learning, a property that is crucial for training deep neural networks at different scales of problem dimensionality.
arXiv Detail & Related papers (2026-02-07T10:19:36Z)
Recursive Meta-Distillation: An Axiomatic Framework for Iterative Knowledge Refinement [0.0]
We introduce an axiomatic and operator-theoretic framework for iterative knowledge distillation as a sequence of probability-distribution operators with explicit anchoring to base teachers.<n>Results provide a theoretical basis for understanding stability, bias-variance behavior, and failure modes in iterative and multi-teacher distillation under capacity constraints.
arXiv Detail & Related papers (2026-01-19T14:39:40Z)
Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation [0.0]
We develop an axiomatic, operator-theoretic framework for multiteacher ensemble knowledge distillation.<n>Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators.
arXiv Detail & Related papers (2026-01-14T05:10:36Z)
Likelihood-guided Regularization in Attention Based Models [1.561268797057701]
We propose a likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs)<n>We show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms.
arXiv Detail & Related papers (2025-11-17T10:38:09Z)
An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based Models [59.13182819190547]
Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields.<n>They face challenges such as complex design specifications and scalability issues with large datasets.<n>This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability.
arXiv Detail & Related papers (2025-11-11T10:28:23Z)
Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models [9.353236468990945]
This paper addresses the limitations of large-scale language models in safety alignment and robustness.<n>It proposes a fine-tuning method that combines contrastive distillation with noise-robust training.<n>Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety.
arXiv Detail & Related papers (2025-10-31T00:54:33Z)
Optimal Regularization Under Uncertainty: Distributional Robustness and Convexity Constraints [9.77322868877488]
We introduce a framework for distributionally robust optimal regularization.<n>We show how the resulting robust regularizers interpolate between computation of the training distribution and uniform priors.
arXiv Detail & Related papers (2025-10-03T19:35:38Z)
Rectifying Conformity Scores for Better Conditional Coverage [75.73184036344908]
We present a new method for generating confidence sets within the split conformal prediction framework.<n>Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage.
arXiv Detail & Related papers (2025-02-22T19:54:14Z)
Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z)
Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences. Our method is especially suitable for problems with well-specified likelihoods. We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z)
Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.