Related papers: Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language Models

Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language Models

URL: http://arxiv.org/abs/2112.07327v1
Date: Tue, 14 Dec 2021 12:26:24 GMT
Title: Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language Models
Authors: Lei Li, Yankai Lin, Xuancheng Ren, Guangxiang Zhao, Peng Li, Jie Zhou, Xu Sun
Abstract summary: We propose a novel model reuse paradigm, Knowledge Amalgamation(KA) for PLMs. Without human annotations available, KA aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. Experimental results demonstrate that MUKA achieves substantial improvements over baselines on benchmark datasets.
Score: 37.88287077119201
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As many fine-tuned pre-trained language models~(PLMs) with promising performance are generously released, investigating better ways to reuse these models is vital as it can greatly reduce the retraining computational cost and the potential environmental side-effects. In this paper, we explore a novel model reuse paradigm, Knowledge Amalgamation~(KA) for PLMs. Without human annotations available, KA aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. The achieve this, we design a Model Uncertainty--aware Knowledge Amalgamation~(MUKA) framework, which identifies the potential adequate teacher using Monte-Carlo Dropout for approximating the golden supervision to guide the student. Experimental results demonstrate that MUKA achieves substantial improvements over baselines on benchmark datasets. Further analysis shows that MUKA can generalize well under several complicate settings with multiple teacher models, heterogeneous teachers, and even cross-dataset teachers.

Related papers

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation [57.91828170220308]
We propose a knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.
arXiv Detail & Related papers (2025-03-23T23:53:08Z)
RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models [60.596005921295806]
Agglomerative models have emerged as a powerful approach to training vision foundation models. We identify critical challenges including resolution mode shifts, teacher imbalance, idiosyncratic teacher artifacts, and an excessive number of output tokens. We propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions.
arXiv Detail & Related papers (2024-12-10T17:06:41Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks. These models can produce hallucinations, particularly in domains with incomplete knowledge. We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z)
Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning [26.393644289860084]
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification. We propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge.
arXiv Detail & Related papers (2024-04-24T07:47:55Z)
Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation [102.91236882045021]
It is essential to explore how to use different pre-trained recommendation models efficiently in real-world systems. We propose a novel curriculum-scheduled knowledge distillation from multiple pre-trained teachers for multi-domain sequential recommendation. CKD-MDSR takes full advantages of different PRMs as multiple teacher models to boost a small student recommendation model.
arXiv Detail & Related papers (2024-01-01T15:57:15Z)
ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse [59.500060790983994]
This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend. ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
arXiv Detail & Related papers (2023-08-17T19:12:13Z)
KDSM: An uplift modeling framework based on knowledge distillation and sample matching [2.036924568983982]
Uplift modeling aims to estimate the treatment effect on individuals. Tree-based methods are adept at fitting increment and generalization, while neural-network-based models excel at predicting absolute value and precision. In this paper, we proposed an uplift modeling framework based on Knowledge Distillation and Sample Matching (KDSM)
arXiv Detail & Related papers (2023-03-06T09:15:28Z)
From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI) KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z)
Deep Learning Models for Knowledge Tracing: Review and Empirical Evaluation [2.423547527175807]
We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets. The evaluated DLKT models have been reimplemented for assessing and replicability of previously reported results.
arXiv Detail & Related papers (2021-12-30T14:19:27Z)
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors. We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method. Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.