Improving Question Answering Performance Using Knowledge Distillation
and Active Learning
- URL: http://arxiv.org/abs/2109.12662v1
- Date: Sun, 26 Sep 2021 17:49:54 GMT
- Title: Improving Question Answering Performance Using Knowledge Distillation
and Active Learning
- Authors: Yasaman Boreshban, Seyed Morteza Mirbostani, Gholamreza Ghassem-Sani,
Seyed Abolghasem Mirroshandel, Shahin Amiriparian
- Abstract summary: We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained BERT system.
We demonstrate that our model achieves the performance of a 6-layer TinyBERT and DistilBERT, whilst using only 2% of their total parameters.
- Score: 6.380750645368325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contemporary question answering (QA) systems, including transformer-based
architectures, suffer from increasing computational and model complexity which
render them inefficient for real-world applications with limited resources.
Further, training or even fine-tuning such models requires a vast amount of
labeled data which is often not available for the task at hand. In this
manuscript, we conduct a comprehensive analysis of the mentioned challenges and
introduce suitable countermeasures. We propose a novel knowledge distillation
(KD) approach to reduce the parameter and model complexity of a pre-trained
BERT system and utilize multiple active learning (AL) strategies for immense
reduction in annotation efforts. In particular, we demonstrate that our model
achieves the performance of a 6-layer TinyBERT and DistilBERT, whilst using
only 2% of their total parameters. Finally, by the integration of our AL
approaches into the BERT framework, we show that state-of-the-art results on
the SQuAD dataset can be achieved when we only use 20% of the training data.
Related papers
- Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Learning to Maximize Mutual Information for Chain-of-Thought Distillation [13.660167848386806]
Distilling Step-by-Step(DSS) has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts.
However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction.
We propose a variational approach to solve this problem using a learning-based method.
arXiv Detail & Related papers (2024-03-05T22:21:45Z) - EsaCL: Efficient Continual Learning of Sparse Models [10.227171407348326]
Key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks.
We propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power.
arXiv Detail & Related papers (2024-01-11T04:59:44Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision.
We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z) - ActKnow: Active External Knowledge Infusion Learning for Question
Answering in Low Data Regime [7.562843347215286]
We propose a technique that actively infuses knowledge from Knowledge Graphs (KG) based "on-demand" into learning for Question Answering (QA)
We show significant improvements on the ARC Challenge-set benchmark over purely text-based transformer models like RoBERTa in the low data regime.
arXiv Detail & Related papers (2021-12-17T10:39:41Z) - Incremental Learning for End-to-End Automatic Speech Recognition [41.297106772785206]
We propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR)
We design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions.
Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting.
arXiv Detail & Related papers (2020-05-11T08:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.