Related papers: Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

URL: http://arxiv.org/abs/2205.02340v1
Date: Wed, 4 May 2022 21:56:57 GMT
Title: Knowledge Distillation of Russian Language Models with Reduction of Vocabulary
Authors: Alina Kolesnikova, Yuri Kuratov, Vasily Konovalov, Mikhail Burtsev
Abstract summary: Transformer language models serve as a core component for majority of natural language processing tasks. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary.
Score: 0.1092387707389144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.

Related papers

TokAlign: Efficient Vocabulary Adaptation via Token Alignment [41.59130966729569]
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text.<n>In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM.<n>We propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view.
arXiv Detail & Related papers (2025-06-04T03:15:57Z)
On Multilingual Encoder Language Model Compression for Low-Resource Languages [10.868526090169283]
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models.<n>We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks.<n> Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses.
arXiv Detail & Related papers (2025-05-22T17:35:39Z)
Multi-Sense Embeddings for Language Models and Knowledge Distillation [17.559171180573664]
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different representations for the same token depending on its surrounding context.<n>We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language.
arXiv Detail & Related papers (2025-04-08T13:36:36Z)
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer. In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z)
The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method [14.423829182894345]
We propose a general language model distillation (GLMD) method that performs two-stage word prediction distillation and vocabulary compression. Experimental results show that our method outperforms 25 state-of-the-art methods on the SuperGLUE benchmark, achieving an average score that surpasses the best method by 3%.
arXiv Detail & Related papers (2023-06-11T08:53:27Z)
Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models [12.670354498961492]
State-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. Knowledge Distillation is one popular technique to develop competitive, lightweight models.
arXiv Detail & Related papers (2022-10-27T05:30:13Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
PERFECT: Prompt-free and Efficient Few-shot Learning with Language Models [67.3725459417758]
PERFECT is a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting. We show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning. Experiments on a wide range of few-shot NLP tasks demonstrate that PERFECT, while being simple and efficient, also outperforms existing state-of-the-art few-shot learning methods.
arXiv Detail & Related papers (2022-04-03T22:31:25Z)
Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT. We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z)
Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning [12.761055946548437]
We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. It is suitable to deploy the final compressed model on resource-constrained edge devices.
arXiv Detail & Related papers (2020-09-17T04:45:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.