LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme
conversion
- URL: http://arxiv.org/abs/2303.01086v1
- Date: Thu, 2 Mar 2023 09:16:21 GMT
- Title: LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme
conversion
- Authors: Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu,
Xiang Yin, Zejun Ma
- Abstract summary: Grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.
Existing methods are either slow or poor in performance, and are limited in application scenarios.
We propose a novel method named LiteG2P which is fast, light and theoretically parallel.
- Score: 18.83348872103488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a key component of automated speech recognition (ASR) and the front-end in
text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting
letters to their corresponding pronunciations. Existing methods are either slow
or poor in performance, and are limited in application scenarios, particularly
in the process of on-device inference. In this paper, we integrate the
advantages of both expert knowledge and connectionist temporal classification
(CTC) based neural network and propose a novel method named LiteG2P which is
fast, light and theoretically parallel. With the carefully leading design,
LiteG2P can be applied both on cloud and on device. Experimental results on the
CMU dataset show that the performance of the proposed method is superior to the
state-of-the-art CTC based method with 10 times fewer parameters, and even
comparable to the state-of-the-art Transformer-based sequence-to-sequence model
with less parameters and 33 times less computation.
Related papers
- Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Bit Cipher -- A Simple yet Powerful Word Representation System that
Integrates Efficiently with Language Models [4.807347156077897]
Bit-cipher is a word representation system that eliminates the need of backpropagation and hyper-efficient dimensionality reduction techniques.
We perform probing experiments on part-of-speech (POS) tagging and named entity recognition (NER) to assess bit-cipher's competitiveness with classic embeddings.
By replacing embedding layers with cipher embeddings, our experiments illustrate the notable efficiency of cipher in accelerating the training process and attaining better optima.
arXiv Detail & Related papers (2023-11-18T08:47:35Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Streaming on-device detection of device directed speech from voice and
touch-based invocation [12.42440115067583]
We propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection.
To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN)
To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion.
arXiv Detail & Related papers (2021-10-09T22:33:42Z) - Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.
We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z) - Transformer based Grapheme-to-Phoneme Conversion [0.9023847175654603]
In this paper, we investigate the application of transformer architecture to G2P conversion.
We compare its performance with recurrent and convolutional neural network based approaches.
The results show that transformer based G2P outperforms the convolutional-based approach in terms of word error rate.
arXiv Detail & Related papers (2020-04-14T07:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.