Joint Encoder-Decoder Self-Supervised Pre-training for ASR
- URL: http://arxiv.org/abs/2206.04465v1
- Date: Thu, 9 Jun 2022 12:45:29 GMT
- Title: Joint Encoder-Decoder Self-Supervised Pre-training for ASR
- Authors: Arunkumar A and Umesh S
- Abstract summary: Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has shown tremendous success in various
speech-related downstream tasks, including Automatic Speech Recognition (ASR).
The output embeddings of the SSL model are treated as powerful short-time
representations of the speech signal. However, in the ASR task, the main
objective is to get the correct sequence of acoustic units, characters, or
byte-pair encodings (BPEs). Usually, encoder-decoder architecture works
exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this
paper, we propose a new paradigm that exploits the power of a decoder during
self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to
compute the conventional masked prediction loss for the encoder. In addition,
we have introduced a decoder in the SSL framework and proposed a target
preparation strategy for the decoder. Finally, we use a multitask SSL setup
wherein we jointly optimize both the encoder and decoder losses. We hypothesize
that the presence of a decoder in the SSL model helps it learn an acoustic
unit-based language model, which might improve the performance of an ASR
downstream task. We compare our proposed SSL model with HuBERT and show up to
25% relative improvement in performance on ASR by finetuning on various
LibriSpeech subsets.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Decoder-only Architecture for Streaming End-to-end Speech Recognition [45.161909551392085]
We propose to use a decoder-only architecture for blockwise streaming automatic speech recognition (ASR)
In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder.
Our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.
arXiv Detail & Related papers (2024-06-23T13:50:08Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL
Models [23.383924361298874]
We propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT.
The proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder.
arXiv Detail & Related papers (2024-02-14T02:11:04Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - SALTTS: Leveraging Self-Supervised Speech Representations for improved
Text-to-Speech Synthesis [0.3007949058551534]
We leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech.
In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations.
The richness of speech characteristics from the SSL features reflects in the output speech quality, with the objective and subjective evaluation measures of the proposed approach outperforming the baseline FastSpeech2.
arXiv Detail & Related papers (2023-08-02T08:59:52Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.