Cross-Modal Transformer-Based Neural Correction Models for Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2107.01569v1
- Date: Sun, 4 Jul 2021 07:58:31 GMT
- Title: Cross-Modal Transformer-Based Neural Correction Models for Automatic
Speech Recognition
- Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi
Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima
- Abstract summary: We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition system.
Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
- Score: 31.2558640840697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a cross-modal transformer-based neural correction models that
refines the output of an automatic speech recognition (ASR) system so as to
exclude ASR errors. Generally, neural correction models are composed of
encoder-decoder networks, which can directly model sequence-to-sequence mapping
problems. The most successful method is to use both input speech and its ASR
output text as the input contexts for the encoder-decoder networks. However,
the conventional method cannot take into account the relationships between
these two different modal inputs because the input contexts are separately
encoded for each modal. To effectively leverage the correlated information
between the two different modal inputs, our proposed models encode two
different contexts jointly on the basis of cross-modal self-attention using a
transformer. We expect that cross-modal self-attention can effectively capture
the relationships between two different modals for refining ASR hypotheses. We
also introduce a shallow fusion technique to efficiently integrate the
first-pass ASR model and our proposed neural correction model. Experiments on
Japanese natural language ASR tasks demonstrated that our proposed models
achieve better ASR performance than conventional neural correction models.
Related papers
- Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition [10.62060432965311]
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR)
Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts.
arXiv Detail & Related papers (2023-10-10T09:04:33Z) - Towards Contextual Spelling Correction for Customization of End-to-end
Speech Recognition Systems [27.483603895258437]
We introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.
We propose filtering algorithms to handle large-size context lists, and performance balancing mechanisms to control the biasing degree of the model.
Experiments show that the proposed method achieves as much as 51% relative word error rate (WER) reduction over ASR system and outperforms traditional biasing methods.
arXiv Detail & Related papers (2022-03-02T06:00:48Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.