Are Intermediate Layers and Labels Really Necessary? A General Language
Model Distillation Method
- URL: http://arxiv.org/abs/2306.06625v1
- Date: Sun, 11 Jun 2023 08:53:27 GMT
- Title: Are Intermediate Layers and Labels Really Necessary? A General Language
Model Distillation Method
- Authors: Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Shu Zhao, Peng
Zhang, Jie Tang
- Abstract summary: We propose a general language model distillation (GLMD) method that performs two-stage word prediction distillation and vocabulary compression.
Experimental results show that our method outperforms 25 state-of-the-art methods on the SuperGLUE benchmark, achieving an average score that surpasses the best method by 3%.
- Score: 14.423829182894345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The large scale of pre-trained language models poses a challenge for their
deployment on various devices, with a growing emphasis on methods to compress
these models, particularly knowledge distillation. However, current knowledge
distillation methods rely on the model's intermediate layer features and the
golden labels (also called hard labels), which usually require aligned model
architecture and enough labeled data respectively. Moreover, the parameters of
vocabulary are usually neglected in existing methods. To address these
problems, we propose a general language model distillation (GLMD) method that
performs two-stage word prediction distillation and vocabulary compression,
which is simple and surprisingly shows extremely strong performance.
Specifically, GLMD supports more general application scenarios by eliminating
the constraints of dimension and structure between models and the need for
labeled datasets through the absence of intermediate layers and golden labels.
Meanwhile, based on the long-tailed distribution of word frequencies in the
data, GLMD designs a strategy of vocabulary compression through decreasing
vocabulary size instead of dimensionality. Experimental results show that our
method outperforms 25 state-of-the-art methods on the SuperGLUE benchmark,
achieving an average score that surpasses the best method by 3%.
Related papers
- Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.
Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.
We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.
Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - Uniform Discretized Integrated Gradients: An effective attribution based method for explaining large language models [0.0]
Integrated Gradients is a well-known technique for explaining deep learning models.
In this paper, we propose a method called Uniform Discretized Integrated Gradients (UDIG)
We evaluate our method on two types of NLP tasks- Sentiment Classification and Question Answering against three metrics viz Log odds, Comprehensiveness and Sufficiency.
arXiv Detail & Related papers (2024-12-05T05:39:03Z) - CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning [4.004641316826348]
We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT)
Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets.
The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
arXiv Detail & Related papers (2024-07-30T17:57:32Z) - Fuzzy Fingerprinting Transformer Language-Models for Emotion Recognition
in Conversations [0.7874708385247353]
We propose to combine the two approaches to perform Emotion Recognition in Conversations (ERC)
We feed utterances and their previous conversational turns to a pre-trained RoBERTa, obtaining contextual embedding utterance representations.
We validate our approach on the widely used DailyDialog ERC benchmark dataset.
arXiv Detail & Related papers (2023-09-08T12:26:01Z) - Compressing Sentence Representation with maximum Coding Rate Reduction [0.0]
In most natural language inference problems, sentence representation is needed for semantic retrieval tasks.
Due to space and time hardware limitations, there is a need to attain comparable results when using the smaller model.
We demonstrate that the new language model with reduced complexity and sentence embedding size can achieve comparable results on semantic retrieval benchmarks.
arXiv Detail & Related papers (2023-04-25T09:23:43Z) - LEAD: Liberal Feature-based Distillation for Dense Retrieval [67.48820723639601]
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model.
Traditional methods include response-based methods and feature-based methods.
In this paper, we propose a liberal feature-based distillation method (LEAD)
arXiv Detail & Related papers (2022-12-10T06:30:54Z) - LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds [62.49198183539889]
We propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds.
Our method co-designs an efficient labeling process with semi/weakly supervised learning.
Our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
arXiv Detail & Related papers (2022-10-14T19:13:36Z) - Knowledge Distillation of Russian Language Models with Reduction of
Vocabulary [0.1092387707389144]
Transformer language models serve as a core component for majority of natural language processing tasks.
Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations.
We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary.
arXiv Detail & Related papers (2022-05-04T21:56:57Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Are We Really Making Much Progress in Text Classification? A Comparative Review [5.33235750734179]
We analyze various methods for single-label and multi-label text classification across well-known datasets.
We highlight the superiority of discriminative language models like BERT over generative models for supervised tasks.
arXiv Detail & Related papers (2022-04-08T09:28:20Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.