Modality-specific Distillation
- URL: http://arxiv.org/abs/2101.01881v1
- Date: Wed, 6 Jan 2021 05:45:07 GMT
- Title: Modality-specific Distillation
- Authors: Woojeong Jin, Maziar Sanjabi, Shaoliang Nie, Liang Tan, Xiang Ren,
Hamed Firooz
- Abstract summary: We propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets.
Our idea aims at mimicking a teacher's modality-specific predictions by introducing an auxiliary loss term for each modality.
Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses.
- Score: 30.190082262375395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large neural networks are impractical to deploy on mobile devices due to
their heavy computational cost and slow inference. Knowledge distillation (KD)
is a technique to reduce the model size while retaining performance by
transferring knowledge from a large "teacher" model to a smaller "student"
model. However, KD on multimodal datasets such as vision-language datasets is
relatively unexplored and digesting such multimodal information is challenging
since different modalities present different types of information. In this
paper, we propose modality-specific distillation (MSD) to effectively transfer
knowledge from a teacher on multimodal datasets. Existing KD approaches can be
applied to multimodal setup, but a student doesn't have access to
modality-specific predictions. Our idea aims at mimicking a teacher's
modality-specific predictions by introducing an auxiliary loss term for each
modality. Because each modality has different importance for predictions, we
also propose weighting approaches for the auxiliary losses; a meta-learning
approach to learn the optimal weights on these loss terms. In our experiments,
we demonstrate the effectiveness of our MSD and the weighting scheme and show
that it achieves better performance than KD.
Related papers
- Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport [46.91791643660991]
Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments.
These models struggle in the wild because of the unavailability and quality of modalities used for training.
In practice, only a subset of the training-time modalities may be available at test time.
Learning with privileged information enables models to exploit data from additional modalities that are only available during training.
arXiv Detail & Related papers (2024-01-27T19:44:15Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Distiller: A Systematic Study of Model Distillation Methods in Natural
Language Processing [21.215122347801696]
We aim to identify how different components in the KD pipeline affect the resulting performance.
We propose Distiller, a meta KD framework that combines a broad range of techniques across different stages of the KD pipeline.
We find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm.
arXiv Detail & Related papers (2021-09-23T02:12:28Z) - Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources.
We achieve this by designing a compact model architecture that maximally reduces model complexity.
Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.