Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation
- URL: http://arxiv.org/abs/2101.08106v1
- Date: Wed, 20 Jan 2021 13:07:39 GMT
- Title: Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation
- Authors: Lingyun Feng, Minghui Qiu, Yaliang Li, Hai-Tao Zheng, Ying Shen
- Abstract summary: We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
- Score: 55.34995029082051
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite pre-trained language models such as BERT have achieved appealing
performance in a wide range of natural language processing tasks, they are
computationally expensive to be deployed in real-time applications. A typical
method is to adopt knowledge distillation to compress these large pre-trained
models (teacher models) to small student models. However, for a target domain
with scarce training data, the teacher can hardly pass useful knowledge to the
student, which yields performance degradation for the student models. To tackle
this problem, we propose a method to learn to augment for data-scarce domain
BERT knowledge distillation, by learning a cross-domain manipulation scheme
that automatically augments the target with the help of resource-rich source
domains. Specifically, the proposed method generates samples acquired from a
stationary distribution near the target data and adopts a reinforced selector
to automatically refine the augmentation strategy according to the performance
of the student. Extensive experiments demonstrate that the proposed method
significantly outperforms state-of-the-art baselines on four different tasks,
and for the data-scarce domains, the compressed student models even perform
better than the original large teacher model, with much fewer parameters (only
${\sim}13.3\%$) when only a few labeled examples available.
Related papers
- Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM)
It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z) - Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets.
LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student.
Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z) - Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning [22.748835458594744]
We introduce Retrieval-based.
Ensemble (RPE), a new method that creates a vectorized database of.
Low-Rank Adaptations (LoRAs)
RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning.
RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data.
arXiv Detail & Related papers (2024-10-13T16:28:38Z) - Knowledge Distillation for Road Detection based on cross-model Semi-Supervised Learning [17.690698736544626]
We propose an integrated approach that combines knowledge distillation and semi-supervised learning methods.
This hybrid approach leverages the robust capabilities of large models to effectively utilise large unlabelled data.
The proposed semi-supervised learning-based knowledge distillation (SSLKD) approach demonstrates a notable improvement in the performance of the student model.
arXiv Detail & Related papers (2024-02-07T22:50:47Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Dual-Teacher: Integrating Intra-domain and Inter-domain Teachers for
Annotation-efficient Cardiac Segmentation [65.81546955181781]
We propose a novel semi-supervised domain adaptation approach, namely Dual-Teacher.
The student model learns the knowledge of unlabeled target data and labeled source data by two teacher models.
We demonstrate that our approach is able to concurrently utilize unlabeled data and cross-modality data with superior performance.
arXiv Detail & Related papers (2020-07-13T10:00:44Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Data Techniques For Online End-to-end Speech Recognition [17.621967685914587]
Practitioners often need to build ASR systems for new use cases in a short amount of time, given limited in-domain data.
While recently developed end-to-end methods largely simplify the modeling pipelines, they still suffer from the data sparsity issue.
We explore a few simple-to-implement techniques for building online ASR systems in an end-to-end fashion, with a small amount of transcribed data in the target domain.
arXiv Detail & Related papers (2020-01-24T22:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.