CoLLD: Contrastive Layer-to-layer Distillation for Compressing
Multilingual Pre-trained Speech Encoders
- URL: http://arxiv.org/abs/2309.07707v2
- Date: Wed, 27 Dec 2023 06:45:35 GMT
- Title: CoLLD: Contrastive Layer-to-layer Distillation for Compressing
Multilingual Pre-trained Speech Encoders
- Authors: Heng-Jui Chang, Ning Dong, Ruslan Mavlyutov, Sravya Popuri, Yu-An
Chung
- Abstract summary: Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks.
Building new encoders for new tasks and deploying them to on-device applications are infeasible.
We propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders.
- Score: 19.32466171141613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale self-supervised pre-trained speech encoders outperform
conventional approaches in speech recognition and translation tasks. Due to the
high cost of developing these large models, building new encoders for new tasks
and deploying them to on-device applications are infeasible. Prior studies
propose model compression methods to address this issue, but those works focus
on smaller models and less realistic tasks. Thus, we propose Contrastive
Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to
compress pre-trained speech encoders by leveraging masked prediction and
contrastive learning to train student models to copy the behavior of a large
teacher model. CoLLD outperforms prior methods and closes the gap between small
and large models on multilingual speech-to-text translation and recognition
benchmarks.
Related papers
- Seal: Advancing Speech Language Models to be Few-Shot Learners [17.03216447533895]
This paper introduces the Seal model, an abbreviation for speech language model.
It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech learner with a frozen language model decoder.
The resulting Seal model exhibits robust performance as a few-shot encoder on two speech understanding tasks.
arXiv Detail & Related papers (2024-07-20T13:28:12Z) - Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining [0.5919433278490629]
BERT (Bidirectional Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks.
DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective.
We argue that the design and research around enhanced masked language modeling decoders have been underappreciated.
arXiv Detail & Related papers (2024-01-29T03:25:11Z) - Inter-connection: Effective Connection between Pre-trained Encoder and
Decoder for Speech Translation [10.103202030679844]
We propose an inter-connection mechanism that aggregates the information from each layer of the speech pre-trained model.
This mechanism increased BLEU by approximately 2 points in en-de, en-ja, and en-zh by increasing parameters by 2K when the speech pre-trained model was frozen.
arXiv Detail & Related papers (2023-05-26T13:01:29Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Differentiable Prompt Makes Pre-trained Language Models Better Few-shot
Learners [23.150999852147283]
This study proposes a novel pluggable, and efficient approach named DifferentiAble pRompT (DART)
It can convert small language models into better few-shot learners without any prompt engineering.
A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance.
arXiv Detail & Related papers (2021-08-30T12:29:25Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.