From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models
- URL: http://arxiv.org/abs/2210.05230v1
- Date: Tue, 11 Oct 2022 07:59:08 GMT
- Title: From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models
- Authors: Lei Li, Yankai Lin, Xuancheng Ren, Guangxiang Zhao, Peng Li, Jie Zhou,
Xu Sun
- Abstract summary: This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI)
KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
- Score: 55.137869702763375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Investigating better ways to reuse the released pre-trained language models
(PLMs) can significantly reduce the computational cost and the potential
environmental side-effects. This paper explores a novel PLM reuse paradigm,
Knowledge Integration (KI). Without human annotations available, KI aims to
merge the knowledge from different teacher-PLMs, each of which specializes in a
different classification problem, into a versatile student model. To achieve
this, we first derive the correlation between virtual golden supervision and
teacher predictions. We then design a Model Uncertainty--aware Knowledge
Integration (MUKI) framework to recover the golden supervision for the student.
Specifically, MUKI adopts Monte-Carlo Dropout to estimate model uncertainty for
the supervision integration. An instance-wise re-weighting mechanism based on
the margin of uncertainty scores is further incorporated, to deal with the
potential conflicting supervision from teachers. Experimental results
demonstrate that MUKI achieves substantial improvements over baselines on
benchmark datasets. Further analysis shows that MUKI can generalize well for
merging teacher models with heterogeneous architectures, and even teachers
major in cross-lingual datasets.
Related papers
- On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning [85.75164588939185]
We study the discriminative probabilistic modeling problem on a continuous domain for (multimodal) self-supervised representation learning.
We conduct generalization error analysis to reveal the limitation of current InfoNCE-based contrastive loss for self-supervised representation learning.
arXiv Detail & Related papers (2024-10-11T18:02:46Z) - Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning [26.393644289860084]
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification.
We propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge.
arXiv Detail & Related papers (2024-04-24T07:47:55Z) - Enhancing Fairness and Performance in Machine Learning Models: A Multi-Task Learning Approach with Monte-Carlo Dropout and Pareto Optimality [1.5498930424110338]
This study introduces an approach to mitigate bias in machine learning by leveraging model uncertainty.
Our approach utilizes a multi-task learning (MTL) framework combined with Monte Carlo (MC) Dropout to assess and mitigate uncertainty in predictions related to protected labels.
arXiv Detail & Related papers (2024-04-12T04:17:50Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - A Bayesian Unification of Self-Supervised Clustering and Energy-Based
Models [11.007541337967027]
We perform a Bayesian analysis of state-of-the-art self-supervised learning objectives.
We show that our objective function allows to outperform existing self-supervised learning strategies.
We also demonstrate that GEDI can be integrated into a neuro-symbolic framework.
arXiv Detail & Related papers (2023-12-30T04:46:16Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language
Models [37.88287077119201]
We propose a novel model reuse paradigm, Knowledge Amalgamation(KA) for PLMs.
Without human annotations available, KA aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
Experimental results demonstrate that MUKA achieves substantial improvements over baselines on benchmark datasets.
arXiv Detail & Related papers (2021-12-14T12:26:24Z) - Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models.
We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs.
Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.