Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language
Models
- URL: http://arxiv.org/abs/2112.07327v1
- Date: Tue, 14 Dec 2021 12:26:24 GMT
- Title: Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language
Models
- Authors: Lei Li, Yankai Lin, Xuancheng Ren, Guangxiang Zhao, Peng Li, Jie Zhou,
Xu Sun
- Abstract summary: We propose a novel model reuse paradigm, Knowledge Amalgamation(KA) for PLMs.
Without human annotations available, KA aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
Experimental results demonstrate that MUKA achieves substantial improvements over baselines on benchmark datasets.
- Score: 37.88287077119201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As many fine-tuned pre-trained language models~(PLMs) with promising
performance are generously released, investigating better ways to reuse these
models is vital as it can greatly reduce the retraining computational cost and
the potential environmental side-effects. In this paper, we explore a novel
model reuse paradigm, Knowledge Amalgamation~(KA) for PLMs. Without human
annotations available, KA aims to merge the knowledge from different
teacher-PLMs, each of which specializes in a different classification problem,
into a versatile student model. The achieve this, we design a Model
Uncertainty--aware Knowledge Amalgamation~(MUKA) framework, which identifies
the potential adequate teacher using Monte-Carlo Dropout for approximating the
golden supervision to guide the student. Experimental results demonstrate that
MUKA achieves substantial improvements over baselines on benchmark datasets.
Further analysis shows that MUKA can generalize well under several complicate
settings with multiple teacher models, heterogeneous teachers, and even
cross-dataset teachers.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks.
These models can produce hallucinations, particularly in domains with incomplete knowledge.
We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z) - Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning [26.393644289860084]
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification.
We propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge.
arXiv Detail & Related papers (2024-04-24T07:47:55Z) - Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation [102.91236882045021]
It is essential to explore how to use different pre-trained recommendation models efficiently in real-world systems.
We propose a novel curriculum-scheduled knowledge distillation from multiple pre-trained teachers for multi-domain sequential recommendation.
CKD-MDSR takes full advantages of different PRMs as multiple teacher models to boost a small student recommendation model.
arXiv Detail & Related papers (2024-01-01T15:57:15Z) - ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model
Reuse [59.500060790983994]
This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend.
ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
arXiv Detail & Related papers (2023-08-17T19:12:13Z) - KDSM: An uplift modeling framework based on knowledge distillation and
sample matching [2.036924568983982]
Uplift modeling aims to estimate the treatment effect on individuals.
Tree-based methods are adept at fitting increment and generalization, while neural-network-based models excel at predicting absolute value and precision.
In this paper, we proposed an uplift modeling framework based on Knowledge Distillation and Sample Matching (KDSM)
arXiv Detail & Related papers (2023-03-06T09:15:28Z) - From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI)
KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z) - Deep Learning Models for Knowledge Tracing: Review and Empirical
Evaluation [2.423547527175807]
We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets.
The evaluated DLKT models have been reimplemented for assessing and replicability of previously reported results.
arXiv Detail & Related papers (2021-12-30T14:19:27Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.