m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers
- URL: http://arxiv.org/abs/2402.16918v3
- Date: Sun, 7 Jul 2024 14:03:04 GMT
- Title: m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers
- Authors: Ka Man Lo, Yiming Liang, Wenyu Du, Yuantao Fan, Zili Wang, Wenhao Huang, Lei Ma, Jie Fu,
- Abstract summary: We propose module-to-module knowledge distillation (m2mKD) for transferring knowledge between modules.
We evaluate m2mKD on two modular neural architectures: Neural Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE)
Applying m2mKD to NACs yields significant improvements in IID accuracy on Tiny-ImageNet and OOD robustness on Tiny-ImageNet-R.
- Score: 27.73393245438193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modular neural architectures are gaining attention for their powerful generalization and efficient adaptation to new domains. However, training these models poses challenges due to optimization difficulties arising from intrinsic sparse connectivity. Leveraging knowledge from monolithic models through techniques like knowledge distillation can facilitate training and enable integration of diverse knowledge. Nevertheless, conventional knowledge distillation approaches are not tailored to modular models and struggle with unique architectures and enormous parameter counts. Motivated by these challenges, we propose module-to-module knowledge distillation (m2mKD) for transferring knowledge between modules. m2mKD combines teacher modules of a pretrained monolithic model and student modules of a modular model with a shared meta model respectively to encourage the student module to mimic the behaviour of the teacher module. We evaluate m2mKD on two modular neural architectures: Neural Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE). Applying m2mKD to NACs yields significant improvements in IID accuracy on Tiny-ImageNet (up to 5.6%) and OOD robustness on Tiny-ImageNet-R (up to 4.2%). Additionally, the V-MoE-Base model trained with m2mKD achieves 3.5% higher accuracy than end-to-end training on ImageNet-1k. Code is available at https://github.com/kamanphoebe/m2mKD.
Related papers
- Improving DNN Modularization via Activation-Driven Training [5.4070914322511925]
MODA promotes inherent modularity within a DNN model by directly regulating the activation outputs of its layers.
It accomplishes modularization with 29% less training time.
It improves the accuracy of a target class by 12% on average while ensuring minimal impact on the accuracy of other classes.
arXiv Detail & Related papers (2024-11-01T23:07:33Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models [31.960749305728488]
We introduce a novel concept dubbed modular neural tangent kernel (mNTK)
We show that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $lambda_max$.
We propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $lambda_max$ exceeding a dynamic threshold.
arXiv Detail & Related papers (2024-05-13T07:46:48Z) - Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs.
We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z) - Module-wise Adaptive Distillation for Multimodality Foundation Models [125.42414892566843]
multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes.
One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer.
Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently.
arXiv Detail & Related papers (2023-10-06T19:24:00Z) - ModuleFormer: Modularity Emerges from Mixture-of-Experts [60.6148988099284]
This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models.
Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
arXiv Detail & Related papers (2023-06-07T17:59:57Z) - Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning.
It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference.
Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z) - Deep Model Assembling [31.88606253639418]
This paper studies a divide-and-conquer strategy to train large models.
It divides a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model.
We introduce a global, shared meta model to implicitly link all the modules together.
This enables us to train highly compatible modules that collaborate effectively when they are assembled together.
arXiv Detail & Related papers (2022-12-08T08:04:06Z) - Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs)
NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge.
NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z) - Neural Network Module Decomposition and Recomposition [35.21448933547118]
We propose a modularization method that decomposes a deep neural network (DNN) into small modules from a functionality perspective.
We demonstrate that the proposed method can decompose and recompose DNNs with high compression ratio and high accuracy.
arXiv Detail & Related papers (2021-12-25T08:36:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.