TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural
Language Processing
- URL: http://arxiv.org/abs/2002.12620v2
- Date: Tue, 28 Apr 2020 02:34:38 GMT
- Title: TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural
Language Processing
- Authors: Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting Liu, Shijin
Wang, Guoping Hu
- Abstract summary: We introduce TextBrewer, an open-source knowledge distillation toolkit for natural language processing.
It supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling.
As a case study, we use TextBrewer to distill BERT on several typical NLP tasks.
- Score: 64.87699383581885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce TextBrewer, an open-source knowledge distillation
toolkit designed for natural language processing. It works with different
neural network models and supports various kinds of supervised learning tasks,
such as text classification, reading comprehension, sequence labeling.
TextBrewer provides a simple and uniform workflow that enables quick setting up
of distillation experiments with highly flexible configurations. It offers a
set of predefined distillation methods and can be extended with custom code. As
a case study, we use TextBrewer to distill BERT on several typical NLP tasks.
With simple configurations, we achieve results that are comparable with or even
higher than the public distilled BERT models with similar numbers of
parameters. Our toolkit is available through: http://textbrewer.hfl-rc.com
Related papers
- Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Lightweight Model Pre-training via Language Guided Knowledge Distillation [28.693835349747598]
This paper studies the problem of pre-training for small models, which is essential for many mobile devices.
We propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student.
Experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation.
arXiv Detail & Related papers (2024-06-17T16:07:19Z) - Data-Free Distillation of Language Model by Text-to-Text Transfer [22.830164917398623]
Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable.
We propose a novel DFKD framework, namely DFKD-T$3$, where the pretrained generative language model can also serve as a controllable data generator for model compression.
Our method can boost the distillation performance in various downstream tasks such as sentiment analysis, linguistic acceptability, and information extraction.
arXiv Detail & Related papers (2023-11-03T03:31:47Z) - SpikeBERT: A Language Spikformer Learned from BERT with Knowledge
Distillation [31.777019330200705]
Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way.
We improve a recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to process language tasks.
We show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese.
arXiv Detail & Related papers (2023-08-29T08:41:16Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z) - Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning
in NLP [24.981991538150584]
MaChAmp is a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings.
The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit.
arXiv Detail & Related papers (2020-05-29T16:54:50Z) - Distilling Knowledge from Pre-trained Language Models via Text Smoothing [9.105324638015366]
We propose a new method for BERT distillation, i.e., asking the teacher to generate smoothed word ids, rather than labels, for teaching the student model in knowledge distillation.
Practically, we use the softmax prediction of the Masked Language Model(MLM) in BERT to generate word distributions for given texts and smooth those input texts using that predicted soft word ids.
We assume both the smoothed labels and the smoothed texts can implicitly augment the input corpus, while text smoothing is intuitively more efficient since it can generate more instances in one neural network forward step.
arXiv Detail & Related papers (2020-05-08T04:34:00Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.