A Deliberation-based Joint Acoustic and Text Decoder
- URL: http://arxiv.org/abs/2303.15293v1
- Date: Thu, 23 Mar 2023 18:02:23 GMT
- Title: A Deliberation-based Joint Acoustic and Text Decoder
- Authors: Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu
- Abstract summary: We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data.
Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance.
- Score: 25.37972380217875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new two-pass E2E speech recognition model that improves ASR
performance by training on a combination of paired data and unpaired text data.
Previously, the joint acoustic and text decoder (JATD) has shown promising
results through the use of text data during model training and the recently
introduced deliberation architecture has reduced recognition errors by
leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD,
combines the spelling correcting abilities of deliberation with JATD's use of
unpaired text data to further improve performance. The proposed model produces
substantial gains across multiple test sets, especially those focused on rare
words, where it reduces word error rate (WER) by between 12% and 22.5%
relative. This is done without increasing model size or requiring multi-stage
training, making Deliberation-JATD an efficient candidate for on-device
applications.
Related papers
- Text Injection for Capitalization and Turn-Taking Prediction in Speech
Models [45.94388391693112]
This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model.
We show results demonstrating that our text injection method boosts capitalization performance for long-tail data.
arXiv Detail & Related papers (2023-08-14T18:28:04Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - JOIST: A Joint Speech and Text Streaming Model For ASR [63.15848310748753]
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
arXiv Detail & Related papers (2022-10-13T20:59:22Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation [91.16551253297588]
COunterfactual Generation via Retrieval and Editing (CORE) is a retrieval-augmented generation framework for creating diverse counterfactual perturbations for training.
CORE first performs a dense retrieval over a task-related unlabeled text corpus using a learned bi-encoder.
CORE then incorporates these into prompts to a large language model with few-shot learning capabilities, for counterfactual editing.
arXiv Detail & Related papers (2022-10-10T17:45:38Z) - Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.