Matryoshka Model Learning for Improved Elastic Student Models
- URL: http://arxiv.org/abs/2505.23337v2
- Date: Mon, 02 Jun 2025 09:31:47 GMT
- Title: Matryoshka Model Learning for Improved Elastic Student Models
- Authors: Chetan Verma, Aditya Srinivas Timmaraju, Cho-Jui Hsieh, Suyash Damle, Ngot Bui, Yang Zhang, Wen Chen, Xin Liu, Prateek Jain, Inderjit S Dhillon,
- Abstract summary: MatTA is a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe.<n>We demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.
- Score: 62.154536258259384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.
Related papers
- Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models [6.324684465674387]
Large language models (LLMs) offer superior ranking capabilities, but it is challenging to deploy in real-time systems due to the high-latency requirements.<n>We propose a novel framework that distills a high performing LLM into a more efficient, low-latency student model.<n>The student model has been successfully deployed in production at Walmart.com with significantly positive metrics.
arXiv Detail & Related papers (2025-05-11T20:00:00Z) - CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation [57.91828170220308]
We propose a knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models.<n>Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.
arXiv Detail & Related papers (2025-03-23T23:53:08Z) - All models are wrong, some are useful: Model Selection with Limited Labels [49.62984196182567]
We introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers.
We show that MODEL SELECTOR drastically reduces the need for labeled data while consistently picking the best or near-best performing model.
Our results further highlight the robustness of MODEL SELECTOR in model selection, as it reduces the labeling cost by up to 72.41% when selecting a near-best model.
arXiv Detail & Related papers (2024-10-17T14:45:56Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Collaborative Learning for Enhanced Unsupervised Domain Adaptation [15.97351561456467]
Collaborative Learning for UDA (CLDA) is a method that updates the teacher's non-salient parameters using the student model and at the same time utilizes the updated teacher model to improve UDA performance of the student model.<n>CLDA achieves an improvement of +0.7% mIoU for the teacher model and +1.4% mIoU for the student model compared to the baseline model in the GTA-to-Cityscapes datasets.
arXiv Detail & Related papers (2024-09-04T13:35:15Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Model Stock: All we need is just a few fine-tuned models [34.449901046895185]
This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance.
We employ significantly fewer models to achieve final weights yet yield superior accuracy.
We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures.
arXiv Detail & Related papers (2024-03-28T15:57:20Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.