Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models
- URL: http://arxiv.org/abs/2411.07563v1
- Date: Tue, 12 Nov 2024 05:38:43 GMT
- Title: Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models
- Authors: Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng,
- Abstract summary: Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems.
Inspired by the success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems are proposed.
The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset.
- Score: 74.71484979138161
- License:
- Abstract: Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.
Related papers
- It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment [72.75844404617959]
This paper proposes a novel cross-granularity alignment gait recognition method, named XGait.
To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces.
Comprehensive experiments on two large-scale gait datasets show XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG.
arXiv Detail & Related papers (2024-11-16T08:54:27Z) - LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study [2.8948274245812327]
Grapheme-to-phoneme (G2P) conversion is critical in speech processing.
Large language models (LLMs) have recently demonstrated significant potential in various language tasks.
We present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language.
arXiv Detail & Related papers (2024-09-13T06:13:55Z) - Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation [14.448198170932226]
Think-on-Graph 2.0 (ToG-2) is a hybrid RAG framework that iteratively retrieves information from both unstructured and structured knowledge sources.
ToG-2 alternates between graph retrieval and context retrieval to search for in-depth clues relevant to the question.
Extensive experiments show that ToG-2 achieves state-of-the-art (SOTA) performance on 6 out of 7 knowledge-intensive datasets with GPT-3.5.
arXiv Detail & Related papers (2024-07-15T15:20:40Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Stay on topic with Classifier-Free Guidance [57.28934343207042]
We show that CFG can be used broadly as an inference-time technique in pure language modeling.
We show that CFG improves the performance of Pythia, GPT-2 and LLaMA-family models across an array of tasks.
arXiv Detail & Related papers (2023-06-30T17:07:02Z) - LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme
conversion [18.83348872103488]
Grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.
Existing methods are either slow or poor in performance, and are limited in application scenarios.
We propose a novel method named LiteG2P which is fast, light and theoretically parallel.
arXiv Detail & Related papers (2023-03-02T09:16:21Z) - r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme
Conversion by Controlled noise introducing and Contextual information
incorporation [32.75866643254402]
We show that neural G2P models are extremely sensitive to orthographical variations in graphemes like spelling mistakes.
We propose three controlled noise introducing methods to synthesize noisy training data.
We incorporate the contextual information with the baseline and propose a robust training strategy to stabilize the training process.
arXiv Detail & Related papers (2022-02-21T13:29:30Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - ContextNet: Improving Convolutional Neural Networks for Automatic Speech
Recognition with Global Context [58.40112382877868]
We propose a novel CNN-RNN-transducer architecture, which we call ContextNet.
ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.
We demonstrate that ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets.
arXiv Detail & Related papers (2020-05-07T01:03:18Z) - Transformer based Grapheme-to-Phoneme Conversion [0.9023847175654603]
In this paper, we investigate the application of transformer architecture to G2P conversion.
We compare its performance with recurrent and convolutional neural network based approaches.
The results show that transformer based G2P outperforms the convolutional-based approach in terms of word error rate.
arXiv Detail & Related papers (2020-04-14T07:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.