A systematic comparison of grapheme-based vs. phoneme-based label units
for encoder-decoder-attention models
- URL: http://arxiv.org/abs/2005.09336v3
- Date: Thu, 15 Apr 2021 16:59:10 GMT
- Title: A systematic comparison of grapheme-based vs. phoneme-based label units
for encoder-decoder-attention models
- Authors: Mohammad Zeineldeen, Albert Zeyer, Wei Zhou, Thomas Ng, Ralf
Schl\"uter, Hermann Ney
- Abstract summary: We do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model.
Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.
- Score: 42.761409598613845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the rationale of end-to-end modeling, CTC, RNN-T or
encoder-decoder-attention models for automatic speech recognition (ASR) use
graphemes or grapheme-based subword units based on e.g. byte-pair encoding
(BPE). The mapping from pronunciation to spelling is learned completely from
data. In contrast to this, classical approaches to ASR employ secondary
knowledge sources in the form of phoneme lists to define phonetic output labels
and pronunciation lexica. In this work, we do a systematic comparison between
grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR
model. We investigate the use of single phonemes as well as BPE-based phoneme
groups as output labels of our model. To preserve a simplified and efficient
decoder design, we also extend the phoneme set by auxiliary units to be able to
distinguish homophones. Experiments performed on the Switchboard 300h and
LibriSpeech benchmarks show that phoneme-based modeling is competitive to
grapheme-based encoder-decoder-attention modeling.
Related papers
- Phoneme-aware Encoding for Prefix-tree-based Contextual ASR [45.161909551392085]
Tree-constrained Pointer Generator ( TCPGen) has shown promise for this purpose.
We propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.
arXiv Detail & Related papers (2023-12-15T07:37:09Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding [5.697227044927832]
We propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder.
Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors.
Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset.
arXiv Detail & Related papers (2023-08-12T05:41:15Z) - IPA-CLIP: Integrating Phonetic Priors into Vision and Language
Pretraining [8.129944388402839]
This paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP)
IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text)
arXiv Detail & Related papers (2023-03-06T13:59:37Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - A Dual-Decoder Conformer for Multilingual Speech Recognition [4.594159253008448]
This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages.
We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information.
Our experiments show that we can obtain a significant reduction in WER over the baseline approaches.
arXiv Detail & Related papers (2021-08-22T09:22:28Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - LSTM Acoustic Models Learn to Align and Pronounce with Graphemes [22.453756228457017]
We propose a grapheme-based speech recognizer that can be trained in a purely data-driven fashion.
We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets.
arXiv Detail & Related papers (2020-08-13T21:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.