A Cohesive Distillation Architecture for Neural Language Models
- URL: http://arxiv.org/abs/2301.08130v1
- Date: Thu, 12 Jan 2023 08:01:53 GMT
- Title: A Cohesive Distillation Architecture for Neural Language Models
- Authors: Jan Philip Wahle
- Abstract summary: A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size.
This study investigates methods for Knowledge Distillation (KD) to provide efficient alternatives to large-scale models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A recent trend in Natural Language Processing is the exponential growth in
Language Model (LM) size, which prevents research groups without a necessary
hardware infrastructure from participating in the development process. This
study investigates methods for Knowledge Distillation (KD) to provide efficient
alternatives to large-scale models. In this context, KD means extracting
information about language encoded in a Neural Network and Lexical Knowledge
Databases. We developed two methods to test our hypothesis that efficient
architectures can gain knowledge from LMs and extract valuable information from
lexical sources. First, we present a technique to learn confident probability
distribution for Masked Language Modeling by prediction weighting of multiple
teacher networks. Second, we propose a method for Word Sense Disambiguation
(WSD) and lexical KD that is general enough to be adapted to many LMs. Our
results show that KD with multiple teachers leads to improved training
convergence. When using our lexical pre-training method, LM characteristics are
not lost, leading to increased performance in Natural Language Understanding
(NLU) tasks over the state-of-the-art while adding no parameters. Moreover, the
improved semantic understanding of our model increased the task performance
beyond WSD and NLU in a real-problem scenario (Plagiarism Detection). This
study suggests that sophisticated training methods and network architectures
can be superior over scaling trainable parameters. On this basis, we suggest
the research area should encourage the development and use of efficient models
and rate impacts resulting from growing LM size equally against task
performance.
Related papers
- Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning [6.404122934568861]
Supervised learning (SL) approaches have achieved impressive performance while utilizing significantly less training data compared to previous methods.
We propose a novel approach that combines SL and RL techniques over the MiniWoB benchmark to leverage the strengths of both methods.
Our experiments demonstrate that our approach outperforms previous SL methods on certain tasks using less data and narrows the performance gap with RL models.
arXiv Detail & Related papers (2024-05-01T13:51:45Z) - Evolving Knowledge Distillation with Large Language Models and Active
Learning [46.85430680828938]
Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks.
Previous research has attempted to distill the knowledge of LLMs into smaller models by generating annotated data.
We propose EvoKD: Evolving Knowledge Distillation, which leverages the concept of active learning to interactively enhance the process of data generation using large language models.
arXiv Detail & Related papers (2024-03-11T03:55:24Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Incorporating Linguistic Knowledge for Abstractive Multi-document
Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model.
We process the dependency information into the linguistic-guided attention mechanism.
With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.