Streaming end-to-end speech recognition with jointly trained neural
feature enhancement
- URL: http://arxiv.org/abs/2105.01254v1
- Date: Tue, 4 May 2021 02:25:41 GMT
- Title: Streaming end-to-end speech recognition with jointly trained neural
feature enhancement
- Authors: Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, Seongkyu Mun, and
Changwoo Han
- Abstract summary: We present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers.
We introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL)
- Score: 20.86554979122057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a streaming end-to-end speech recognition model
based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement
layers. Even though the MoCha attention enables streaming speech recognition
with recognition accuracy comparable to a full attention-based approach,
training this model is sensitive to various factors such as the difficulty of
training examples, hyper-parameters, and so on. Because of these issues, speech
recognition accuracy of a MoCha-based model for clean speech drops
significantly when a multi-style training approach is applied. Inspired by
Curriculum Learning [1], we introduce two training strategies: Gradual
Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss
(GREL). With GAEF, the model is initially trained using clean features.
Subsequently, the portion of outputs from the enhancement layers gradually
increases. With GREL, the portion of the Mean Squared Error (MSE) loss for the
enhanced output gradually reduces as training proceeds. In experimental results
on the LibriSpeech corpus and noisy far-field test sets, the proposed model
with GAEF-GREL training strategies shows significantly better results than the
conventional multi-style training approach.
Related papers
- Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models [5.576934300567641]
This paper introduces a novel training framework called Focused Discriminative Training (FDT) to improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models.
The proposed approach presents a novel framework to identify and improve a model's recognition on challenging segments of an audio.
arXiv Detail & Related papers (2024-08-23T11:54:25Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - End-to-End Speech Recognition and Disfluency Removal with Acoustic
Language Model Pretraining [0.0]
We revisit the performance comparison between two-stage and end-to-end model.
We find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models.
arXiv Detail & Related papers (2023-09-08T17:12:14Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - A Novel Speech Intelligibility Enhancement Model based on
CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model.
We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Improving Music Performance Assessment with Contrastive Learning [78.8942067357231]
This study investigates contrastive learning as a potential method to improve existing MPA systems.
We introduce a weighted contrastive loss suitable for regression tasks applied to a convolutional neural network.
Our results show that contrastive-based methods are able to match and exceed SoTA performance for MPA regression tasks.
arXiv Detail & Related papers (2021-08-03T19:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.