A Comparison of Label-Synchronous and Frame-Synchronous End-to-End
Models for Speech Recognition
- URL: http://arxiv.org/abs/2005.10113v2
- Date: Mon, 25 May 2020 08:56:51 GMT
- Title: A Comparison of Label-Synchronous and Frame-Synchronous End-to-End
Models for Speech Recognition
- Authors: Linhao Dong, Cheng Yi, Jianzong Wang, Shiyu Zhou, Shuang Xu, Xueli
Jia, Bo Xu
- Abstract summary: We make a comparison on a representative label-synchronous model (transformer) and a soft frame-synchronous model (continuous integrate-and-fire) based model.
The results on three public dataset and a large-scale dataset with 12000 hours of training data show that the two types of models have respective advantages that are consistent with their synchronous mode.
- Score: 35.14176176739817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end models are gaining wider attention in the field of automatic
speech recognition (ASR). One of their advantages is the simplicity of building
that directly recognizes the speech frame sequence into the text label sequence
by neural networks. According to the driving end in the recognition process,
end-to-end ASR models could be categorized into two types: label-synchronous
and frame-synchronous, each of which has unique model behaviour and
characteristic. In this work, we make a detailed comparison on a representative
label-synchronous model (transformer) and a soft frame-synchronous model
(continuous integrate-and-fire (CIF) based model). The results on three public
dataset and a large-scale dataset with 12000 hours of training data show that
the two types of models have respective advantages that are consistent with
their synchronous mode.
Related papers
- On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures [19.823015917720284]
We evaluate the utility of synthetic data for training automatic speech recognition.
We reproduce the original training data, training ASR systems solely on synthetic data.
We show that the TTS models generalize well, even when training scores indicate overfitting.
arXiv Detail & Related papers (2024-07-25T12:44:45Z) - ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data [8.660721666999718]
We propose a hybrid pipeline composed of asynchronous sensing and synchronous processing.
We achieve performances state-of-the-art with a lower latency than competitors.
arXiv Detail & Related papers (2024-02-02T13:17:19Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - BiCro: Noisy Correspondence Rectification for Multi-modality Data via
Bi-directional Cross-modal Similarity Consistency [66.8685113725007]
BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree.
experiments on three popular cross-modal matching datasets demonstrate that BiCro significantly improves the noise-robustness of various matching models.
arXiv Detail & Related papers (2023-03-22T09:33:50Z) - Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition? [36.67937514793215]
Cross-modal attention is seen as an effective mechanism for multi-modal fusion.
We implement and compare a cross-attention and a self-attention model.
We compare the models using different modality combinations for a 7-class emotion classification task.
arXiv Detail & Related papers (2022-02-18T15:44:14Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Improving Label Quality by Jointly Modeling Items and Annotators [68.8204255655161]
We propose a fully Bayesian framework for learning ground truth labels from noisy annotators.
Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model.
arXiv Detail & Related papers (2021-06-20T02:15:20Z) - Equivalence of Segmental and Neural Transducer Modeling: A Proof of
Concept [56.46135010588918]
We prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent.
It is shown that blank probabilities translate into segment length probabilities and vice versa.
arXiv Detail & Related papers (2021-04-13T11:20:48Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.