Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model
- URL: http://arxiv.org/abs/2402.14035v3
- Date: Wed, 15 May 2024 12:42:04 GMT
- Title: Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model
- Authors: Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao,
- Abstract summary: We propose creating a teaching committee comprising both foundation model teachers and complementary teachers.
complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models.
Our evaluations demonstrate that adding complementary teachers enhances student performance.
- Score: 43.5276936177329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in foundation models have yielded impressive performance across a wide range of tasks. Meanwhile, for specific applications, practitioners have been developing specialized application models. To enjoy the benefits of both kinds of models, one natural path is to transfer the knowledge in foundation models into specialized application models, which are generally more efficient for serving. Techniques from knowledge distillation may be applied here, where the application model learns to mimic the foundation model. However, specialized application models and foundation models have substantial gaps in capacity, employing distinct architectures, using different input features from different modalities, and being optimized on different distributions. These differences in model characteristics lead to significant challenges for distillation methods. In this work, we propose creating a teaching committee comprising both foundation model teachers and complementary teachers. Complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models for a smoother knowledge transfer. Further, to accommodate the dissimilarity among the teachers in the committee, we introduce DiverseDistill, which allows the student to understand the expertise of each teacher and extract task knowledge. Our evaluations demonstrate that adding complementary teachers enhances student performance. Finally, DiverseDistill consistently outperforms baseline distillation methods, regardless of the teacher choices, resulting in significantly improved student performance.
Related papers
- CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation [57.91828170220308]
We propose a knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models.
Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.
arXiv Detail & Related papers (2025-03-23T23:53:08Z) - Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence [18.640219880439062]
This paper presents an innovative approach to leverage intermediate spatial representations.
We propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models.
arXiv Detail & Related papers (2025-03-09T16:32:25Z) - A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning [136.89318317245855]
MoErging aims to recycle expert models to create an aggregate system with improved performance or generalization.
A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application.
This survey includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method.
arXiv Detail & Related papers (2024-08-13T17:49:00Z) - Aligning Teacher with Student Preferences for Tailored Training Data Generation [40.85451525264779]
We propose ARTE, dubbed Aligning TeacheR with StudenT PreferencEs, to generate tailored training examples for Knowledge Distillation.
Specifically, we elicit draft questions and rationales from the teacher model, then collect student preferences on these questions and rationales.
In the end, we repeat the first step with the aligned teacher model to elicit tailored training examples for the student model on the target task.
arXiv Detail & Related papers (2024-06-27T14:51:17Z) - Low-Rank Knowledge Decomposition for Medical Foundation Models [37.52464627899668]
We develop a new perspective called Knowledge Decomposition'' to improve the performance on specific medical tasks.
Low-Rank Knowledge Decomposition (LoRKD) incorporates low-rank expert modules and the efficient knowledge separation convolution.
Experiments show that decomposed models perform well in terms of performance and transferability, even surpassing the original foundation models.
arXiv Detail & Related papers (2024-04-26T06:30:47Z) - Towards Efficient Task-Driven Model Reprogramming with Foundation Models [52.411508216448716]
Vision foundation models exhibit impressive power, benefiting from the extremely large model capacity and broad training data.
However, in practice, downstream scenarios may only support a small model due to the limited computational resources or efficiency considerations.
This brings a critical challenge for the real-world application of foundation models: one has to transfer the knowledge of a foundation model to the downstream task.
arXiv Detail & Related papers (2023-04-05T07:28:33Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Knowledge Distillation with the Reused Teacher Classifier [31.22117343316628]
We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap.
Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
arXiv Detail & Related papers (2022-03-26T06:28:46Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.