CoDERT: Distilling Encoder Representations with Co-learning for
Transducer-based Speech Recognition
- URL: http://arxiv.org/abs/2106.07734v1
- Date: Mon, 14 Jun 2021 20:03:57 GMT
- Title: CoDERT: Distilling Encoder Representations with Co-learning for
Transducer-based Speech Recognition
- Authors: Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo,
Athanasios Mouchtaris
- Abstract summary: We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions.
We introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively.
- Score: 14.07385381963374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a simple yet effective method to compress an RNN-Transducer
(RNN-T) through the well-known knowledge distillation paradigm. We show that
the transducer's encoder outputs naturally have a high entropy and contain rich
information about acoustically similar word-piece confusions. This rich
information is suppressed when combined with the lower entropy decoder outputs
to produce the joint network logits. Consequently, we introduce an auxiliary
loss to distill the encoder logits from a teacher transducer's encoder, and
explore training strategies where this encoder distillation works effectively.
We find that tandem training of teacher and student encoders with an inplace
encoder distillation outperforms the use of a pre-trained and static teacher
transducer. We also report an interesting phenomenon we refer to as implicit
distillation, that occurs when the teacher and student encoders share the same
decoder. Our experiments show 5.37-8.4% relative word error rate reductions
(WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test
sets.
Related papers
- How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? [99.87554379608224]
Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal.
Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.
We propose a novel Contrastive Partial Ranking Distillation (DCPR) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning.
arXiv Detail & Related papers (2024-07-10T09:10:01Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - AWEncoder: Adversarial Watermarking Pre-trained Encoders in Contrastive
Learning [18.90841192412555]
We introduce AWEncoder, an adversarial method for watermarking the pre-trained encoder in contrastive learning.
The proposed work enjoys pretty good effectiveness and robustness on different contrastive learning algorithms and downstream tasks.
arXiv Detail & Related papers (2022-08-08T07:23:37Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - A Generative Approach to Titling and Clustering Wikipedia Sections [12.154365109117025]
We evaluate transformer encoders with various decoders for information organization through a new task: generation of section headings for Wikipedia articles.
Our analysis shows that decoders containing attention mechanisms over the encoder output achieve high-scoring results by generating extractive text.
A decoder without attention better facilitates semantic encoding and can be used to generate section embeddings.
arXiv Detail & Related papers (2020-05-22T14:49:07Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.