Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity
Matching
- URL: http://arxiv.org/abs/2209.05869v1
- Date: Tue, 13 Sep 2022 10:33:04 GMT
- Title: Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity
Matching
- Authors: Kunbo Ding, Weijie Liu, Yuejian Fang, Zhe Zhao, Qi Ju, Xuefeng Yang
- Abstract summary: Cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks.
We propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model.
Our method can compress the size of XLM-R and MiniLM by more than 50%, while the performance is only reduced by about 1%.
- Score: 12.833080411053842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous studies have proved that cross-lingual knowledge distillation can
significantly improve the performance of pre-trained models for cross-lingual
similarity matching tasks. However, the student model needs to be large in this
operation. Otherwise, its performance will drop sharply, thus making it
impractical to be deployed to memory-limited devices. To address this issue, we
delve into cross-lingual knowledge distillation and propose a multi-stage
distillation framework for constructing a small-size but high-performance
cross-lingual model. In our framework, contrastive learning, bottleneck, and
parameter recurrent strategies are combined to prevent performance from being
compromised during the compression process. The experimental results
demonstrate that our method can compress the size of XLM-R and MiniLM by more
than 50\%, while the performance is only reduced by about 1%.
Related papers
- On Multilingual Encoder Language Model Compression for Low-Resource Languages [10.868526090169283]
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models.<n>We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks.<n> Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses.
arXiv Detail & Related papers (2025-05-22T17:35:39Z) - Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration [31.50005609235654]
This study rethinks the current landscape of training-free token reduction research.
We propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages.
Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs.
arXiv Detail & Related papers (2024-11-26T18:53:51Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z) - Beyond English-Centric Bitexts for Better Multilingual Language
Representation Learning [99.42850643947439]
We show that going beyond English-centric bitexts, coupled with a novel sampling strategy, substantially boosts performance across model sizes.
Our XY-LENT XL variant outperforms XLM-RXXL and exhibits competitive performance with mT5 XXL while being 5x and 6x smaller respectively.
arXiv Detail & Related papers (2022-10-26T17:16:52Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - DiSparse: Disentangled Sparsification for Multitask Model Compression [92.84435347164435]
DiSparse is a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme.
Our experimental results demonstrate superior performance on various configurations and settings.
arXiv Detail & Related papers (2022-06-09T17:57:46Z) - Compressing Sentence Representation for Semantic Retrieval via
Homomorphic Projective Distillation [28.432799973328127]
We propose Homomorphic Projective Distillation (HPD) to learn compressed sentence embeddings.
Our method augments a small Transformer encoder model with learnable projection layers to produce compact representations.
arXiv Detail & Related papers (2022-03-15T07:05:43Z) - Multi-Level Contrastive Learning for Cross-Lingual Alignment [35.33431650608965]
Cross-language pre-trained models such as multilingual BERT (mBERT) have achieved significant performance in various cross-lingual downstream NLP tasks.
This paper proposes a multi-level contrastive learning framework to further improve the cross-lingual ability of pre-trained models.
arXiv Detail & Related papers (2022-02-26T07:14:20Z) - Lightweight Cross-Lingual Sentence Representation Learning [57.9365829513914]
We introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations.
We propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task.
arXiv Detail & Related papers (2021-05-28T14:10:48Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.