Related papers: Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

URL: http://arxiv.org/abs/2002.05150v1
Date: Wed, 12 Feb 2020 18:53:56 GMT
Title: Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances
Authors: Phillip Keung, Wei Niu, Yichao Lu, Julian Salazar, Vikas Bhardwaj
Abstract summary: We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We observe that there are many 5-second recordings that produce more than 500 characters of decoding output. A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts.
Score: 16.639133822656458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We observe that there are many 5-second recordings that produce more than 500 characters of decoding output (i.e. more than 100 characters per second). A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts. These decoding issues are reproducible in a speech transformer model from ESPnet, and to a lesser extent in a self-attention CTC model, suggesting that these issues are intrinsic to the use of the attention mechanism. We create a separate length prediction model to predict the correct number of wordpieces in the output, which allows us to identify and truncate problematic decoding results without increasing word error rates on the LibriSpeech task.

Related papers

SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z)
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text. Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech [10.291482850329892]
We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
arXiv Detail & Related papers (2021-12-10T23:07:48Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.