Improved Multi-Stage Training of Online Attention-based Encoder-Decoder
Models
- URL: http://arxiv.org/abs/1912.12384v1
- Date: Sat, 28 Dec 2019 02:29:33 GMT
- Title: Improved Multi-Stage Training of Online Attention-based Encoder-Decoder
Models
- Authors: Abhinav Garg, Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Mehul
Kumar and Chanwoo Kim
- Abstract summary: We propose a refined multi-stage multi-task training strategy to improve the performance of online attention-based encoder-decoder models.
A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed.
Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively.
- Score: 20.81248613653279
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a refined multi-stage multi-task training strategy
to improve the performance of online attention-based encoder-decoder (AED)
models. A three-stage training based on three levels of architectural
granularity namely, character encoder, byte pair encoding (BPE) based encoder,
and attention decoder, is proposed. Also, multi-task learning based on
two-levels of linguistic granularity namely, character and BPE, is used. We
explore different pre-training strategies for the encoders including transfer
learning from a bidirectional encoder. Our encoder-decoder models with online
attention show 35% and 10% relative improvement over their baselines for
smaller and bigger models, respectively. Our models achieve a word error rate
(WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and
bigger models respectively after fusion with long short-term memory (LSTM)
based external language model (LM).
Related papers
- 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [34.421335513040795]
Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks.
We introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
arXiv Detail & Related papers (2024-04-09T02:51:05Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Speculative Contrastive Decoding [55.378200871224074]
Large language models(LLMs) exhibit exceptional performance in language tasks, yet their auto-regressive inference is limited due to high computational requirements and is sub-optimal due to the exposure bias.
Inspired by speculative decoding and contrastive decoding, we introduce Speculative Contrastive Decoding(SCD), a straightforward yet powerful decoding approach.
arXiv Detail & Related papers (2023-11-15T14:15:30Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - An Exploration of Encoder-Decoder Approaches to Multi-Label
Classification for Legal and Biomedical Text [20.100081284294973]
We compare four methods for multi-label classification, two based on an encoder only, and two based on an encoder-decoder.
Our results show that encoder-decoder methods outperform encoder-only methods, with a growing advantage on more complex datasets.
arXiv Detail & Related papers (2023-05-09T17:13:53Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Large-scale Transfer Learning for Low-resource Spoken Language
Understanding [31.013231069185387]
We propose an attention-based Spoken Language Understanding model together with three encoder enhancement strategies to overcome data sparsity challenge.
Cross-language transfer learning and multi-task strategies have been improved by up to 4:52% and 3:89% respectively, compared to the baseline.
arXiv Detail & Related papers (2020-08-13T03:43:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.