Structured State Space Decoder for Speech Recognition and Synthesis
- URL: http://arxiv.org/abs/2210.17098v1
- Date: Mon, 31 Oct 2022 06:54:23 GMT
- Title: Structured State Space Decoder for Speech Recognition and Synthesis
- Authors: Koichi Miyazaki, Masato Murata, Tomoki Koriyama
- Abstract summary: A structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks.
In this study, we applied S4 as a decoder for ASR and text-to-speech tasks by comparing it with the Transformer decoder.
For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25%.
- Score: 9.354721572095272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech recognition (ASR) systems developed in recent years have
shown promising results with self-attention models (e.g., Transformer and
Conformer), which are replacing conventional recurrent neural networks.
Meanwhile, a structured state space model (S4) has been recently proposed,
producing promising results for various long-sequence modeling tasks, including
raw speech classification. The S4 model can be trained in parallel, same as the
Transformer model. In this study, we applied S4 as a decoder for ASR and
text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For
the ASR task, our experimental results demonstrate that the proposed model
achieves a competitive word error rate (WER) of 1.88%/4.25% on LibriSpeech
test-clean/test-other set and a character error rate (CER) of 3.80%/2.63%/2.98%
on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed model is more
robust than the standard Transformer model, particularly for long-form speech
on both the datasets. For the TTS task, the proposed method outperforms the
Transformer baseline.
Related papers
- A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR [0.31077024712075796]
Punctuation and word casing prediction are necessary for automatic speech recognition (ASR)
We propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time.
arXiv Detail & Related papers (2024-07-18T04:01:12Z) - Augmenting conformers with structured state-space sequence models for
online speech recognition [41.444671189679994]
Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems.
In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4)
We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions.
Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
arXiv Detail & Related papers (2023-09-15T17:14:17Z) - Transformer-based approaches to Sentiment Detection [55.41644538483948]
We examined the performance of four different types of state-of-the-art transformer models for text classification.
The RoBERTa transformer model performs best on the test dataset with a score of 82.6% and is highly recommended for quality predictions.
arXiv Detail & Related papers (2023-03-13T17:12:03Z) - Non-autoregressive sequence-to-sequence voice conversion [47.521186595305984]
This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.
We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence.
arXiv Detail & Related papers (2021-04-14T11:53:51Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.