Phoneme-aware Encoding for Prefix-tree-based Contextual ASR
- URL: http://arxiv.org/abs/2312.09582v1
- Date: Fri, 15 Dec 2023 07:37:09 GMT
- Title: Phoneme-aware Encoding for Prefix-tree-based Contextual ASR
- Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant
Arora, Shinji Watanabe
- Abstract summary: Tree-constrained Pointer Generator ( TCPGen) has shown promise for this purpose.
We propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.
- Score: 45.161909551392085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In speech recognition applications, it is important to recognize
context-specific rare words, such as proper nouns. Tree-constrained Pointer
Generator (TCPGen) has shown promise for this purpose, which efficiently biases
such words with a prefix tree. While the original TCPGen relies on
grapheme-based encoding, we propose extending it with phoneme-aware encoding to
better recognize words of unusual pronunciations. As TCPGen handles biasing
words as subword units, we propose obtaining subword-level phoneme-aware
encoding by using alignment between phonemes and subwords. Furthermore, we
propose injecting phoneme-level predictions from CTC into queries of TCPGen so
that the model better interprets the phoneme-aware encodings. We conducted ASR
experiments with TCPGen for RNN transducer. We observed that proposed
phoneme-aware encoding outperformed ordinary grapheme-based encoding on both
the English LibriSpeech and Japanese CSJ datasets, demonstrating the robustness
of our approach across linguistically diverse languages.
Related papers
- T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - IPA-CLIP: Integrating Phonetic Priors into Vision and Language
Pretraining [8.129944388402839]
This paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP)
IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text)
arXiv Detail & Related papers (2023-03-06T13:59:37Z) - Tree-constrained Pointer Generator with Graph Neural Network Encodings
for Contextual Speech Recognition [19.372248692745167]
This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator ( TCPGen) component for end-to-end contextual ASR.
TCPGen with GNN encodings achieved about a further 15% relative WER reduction on the biasing words compared to the original TCPGen.
arXiv Detail & Related papers (2022-07-02T15:12:18Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Tree-constrained Pointer Generator for End-to-end Contextual Speech
Recognition [16.160767678589895]
TCPGen is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models.
TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut to facilitate recognising biasing words during decoding.
arXiv Detail & Related papers (2021-09-01T21:41:59Z) - A Dual-Decoder Conformer for Multilingual Speech Recognition [4.594159253008448]
This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages.
We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information.
Our experiments show that we can obtain a significant reduction in WER over the baseline approaches.
arXiv Detail & Related papers (2021-08-22T09:22:28Z) - Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in
Multitask End-to-End Speech Translation [127.54315184545796]
Speech translation (ST) aims to learn transformations from speech in the source language to the text in the target language.
We propose to improve the multitask ST model by utilizing word embedding as the intermediate.
arXiv Detail & Related papers (2020-05-21T14:22:35Z) - A systematic comparison of grapheme-based vs. phoneme-based label units
for encoder-decoder-attention models [42.761409598613845]
We do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model.
Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.
arXiv Detail & Related papers (2020-05-19T09:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.