Generation-Distillation for Efficient Natural Language Understanding in
Low-Data Settings
- URL: http://arxiv.org/abs/2002.00733v1
- Date: Sat, 25 Jan 2020 08:20:46 GMT
- Title: Generation-Distillation for Efficient Natural Language Understanding in
Low-Data Settings
- Authors: Luke Melas-Kyriazi, George Han, Celine Liang
- Abstract summary: Transfer learning with large-scale language models (LM) has led to dramatic performance improvements across a broad range of natural language understanding tasks.
The size and memory footprint of these large LMs makes them difficult to deploy in many scenarios.
Recent research points to knowledge distillation as a potential solution, showing that when training data for a given task is abundant, it is possible to distill a large (teacher) LM into a small task-specific (student) network with minimal loss of performance.
- Score: 5.929956715430167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the past year, the emergence of transfer learning with large-scale
language models (LM) has led to dramatic performance improvements across a
broad range of natural language understanding tasks. However, the size and
memory footprint of these large LMs makes them difficult to deploy in many
scenarios (e.g. on mobile phones). Recent research points to knowledge
distillation as a potential solution, showing that when training data for a
given task is abundant, it is possible to distill a large (teacher) LM into a
small task-specific (student) network with minimal loss of performance.
However, when such data is scarce, there remains a significant performance gap
between large pretrained LMs and smaller task-specific models, even when
training via distillation. In this paper, we bridge this gap with a novel
training approach, called generation-distillation, that leverages large
finetuned LMs in two ways: (1) to generate new (unlabeled) training examples,
and (2) to distill their knowledge into a small network using these examples.
Across three low-resource text classification datsets, we achieve comparable
performance to BERT while using 300x fewer parameters, and we outperform prior
approaches to distillation for text classification while using 3x fewer
parameters.
Related papers
- TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of
Large Language Models [15.991777903345575]
Large language models can generalize to novel downstream tasks with relatively few labeled examples.
Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples.
We study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models.
arXiv Detail & Related papers (2023-10-02T11:49:05Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM [31.25193238045053]
We introduce a novel method, namely GenCo, which leverages the strong generative power of large language models to assist in training a smaller language model.
In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways.
It helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels.
arXiv Detail & Related papers (2023-04-24T07:35:38Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.