Transformer based Grapheme-to-Phoneme Conversion
- URL: http://arxiv.org/abs/2004.06338v2
- Date: Fri, 26 Jun 2020 21:09:53 GMT
- Title: Transformer based Grapheme-to-Phoneme Conversion
- Authors: Sevinj Yolchuyeva, G\'eza N\'emeth, B\'alint Gyires-T\'oth
- Abstract summary: In this paper, we investigate the application of transformer architecture to G2P conversion.
We compare its performance with recurrent and convolutional neural network based approaches.
The results show that transformer based G2P outperforms the convolutional-based approach in terms of word error rate.
- Score: 0.9023847175654603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention mechanism is one of the most successful techniques in deep learning
based Natural Language Processing (NLP). The transformer network architecture
is completely based on attention mechanisms, and it outperforms
sequence-to-sequence models in neural machine translation without recurrent and
convolutional layers. Grapheme-to-phoneme (G2P) conversion is a task of
converting letters (grapheme sequence) to their pronunciations (phoneme
sequence). It plays a significant role in text-to-speech (TTS) and automatic
speech recognition (ASR) systems. In this paper, we investigate the application
of transformer architecture to G2P conversion and compare its performance with
recurrent and convolutional neural network based approaches. Phoneme and word
error rates are evaluated on the CMUDict dataset for US English and the NetTalk
dataset. The results show that transformer based G2P outperforms the
convolutional-based approach in terms of word error rate and our results
significantly exceeded previous recurrent approaches (without attention)
regarding word and phoneme error rates on both datasets. Furthermore, the size
of the proposed model is much smaller than the size of the previous approaches.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Transformers meet Neural Algorithmic Reasoners [16.5785372289558]
We propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs)
We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning.
arXiv Detail & Related papers (2024-06-13T16:42:06Z) - LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme
conversion [18.83348872103488]
Grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.
Existing methods are either slow or poor in performance, and are limited in application scenarios.
We propose a novel method named LiteG2P which is fast, light and theoretically parallel.
arXiv Detail & Related papers (2023-03-02T09:16:21Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - Decoupling Pronunciation and Language for End-to-end Code-switching
Automatic Speech Recognition [66.47000813920617]
We propose a decoupled transformer model to use monolingual paired data and unpaired text data.
The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network.
By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model.
arXiv Detail & Related papers (2020-10-28T07:46:15Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Deep Transformer based Data Augmentation with Subword Units for
Morphologically Rich Online ASR [0.0]
Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR.
Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to traditional n-grams by using neural text generation based data augmentation.
We show that although data augmentation with Transformer-generated text works well for isolating languages, it causes a vocabulary explosion in a morphologically rich language.
We propose a new method called subword-based neural text augmentation, where we retokenize the generated text into statistically derived subwords.
arXiv Detail & Related papers (2020-07-14T10:22:05Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.