Polyphone Disambiguition in Mandarin Chinese with Semi-Supervised
Learning
- URL: http://arxiv.org/abs/2102.00621v1
- Date: Mon, 1 Feb 2021 03:47:59 GMT
- Title: Polyphone Disambiguition in Mandarin Chinese with Semi-Supervised
Learning
- Authors: Yi Shi and Congyi Wang and Yu Chen and Bin Wang
- Abstract summary: We propose a novel semi-supervised learning framework for Mandarin Chinese polyphone disambiguation.
We explore the effect of various proxy labeling strategies including entropy-thresholding and lexicon-based labeling.
- Score: 9.595035978417322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The majority of Chinese characters are monophonic, i.e.their pronunciations
are unique and thus can be induced easily using a check table. As for their
counterparts, polyphonic characters have more than one pronunciation. To
perform linguistic computation tasks related to spoken Mandarin Chinese, the
correct pronunciation for each polyphone must be identified among several
candidates according to its context. This process is called Polyphone
Disambiguation, a key procedure in the Grapheme-to-phoneme (G2P) conversion
step of a Chinese text-to-speech (TTS) system. The problem is well explored
with both knowledge-based and learning-based approaches, yet it remains
challenging due to the lack of publicly available datasets and complex language
phenomenon concerned polyphone. In this paper, we propose a novel
semi-supervised learning (SSL) framework for Mandarin Chinese polyphone
disambiguation that can potentially leverage unlimited unlabeled text data. We
explore the effect of various proxy labeling strategies including
entropy-thresholding and lexicon-based labeling. As for the architecture, a
pre-trained model of Electra is combined with Convolution BLSTM layers to
fine-tune on our task. Qualitative and quantitative experiments demonstrate
that our method achieves state-of-the-art performance in Mandarin Chinese
polyphone disambiguation. In addition, we publish a novel dataset specifically
for the polyphone disambiguation task to promote further researches.
Related papers
- Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT [81.99600765234285]
We propose an end-to-end framework to predict the pronunciation of a polyphonic character.
The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier.
arXiv Detail & Related papers (2025-01-02T06:51:52Z) - MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and
Phonetic Domains for Speech Representation Learning [0.76146285961466]
We present a methodology for linguistic feature extraction, focusing on automatically syllabifying words in multiple languages.
In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification.
The system was built with open-source components and resources.
arXiv Detail & Related papers (2023-10-17T19:27:23Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone
Disambiguation [35.35236347070773]
We build a Grapheme-to-Phoneme (G2P) model to predict the pronunciation of polyphonic character, and a Phoneme-to-Grapheme (P2G) model to predict pronunciation into text.
We design a data balance strategy to improve the accuracy of some typical polyphonic characters in the training set with imbalanced distribution or data scarcity.
arXiv Detail & Related papers (2022-11-17T12:37:41Z) - A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese [2.380039717474099]
Grapheme-to-phoneme (G2P) conversion is an indispensable part of the Chinese Mandarin text-to-speech (TTS) system.
In this paper, we propose a Chinese polyphone BERT model to predict the pronunciations of Chinese polyphonic characters.
arXiv Detail & Related papers (2022-07-01T09:16:29Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - AlloVera: A Multilingual Allophone Database [137.3686036294502]
AlloVera provides mappings from 218 allophones to phonemes for 14 languages.
We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task.
arXiv Detail & Related papers (2020-04-17T02:02:18Z) - g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin
Chinese Based on a New Open Benchmark Dataset [14.323478990713477]
We introduce a new benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.
We train a simple neural network model on it, and find that it outperforms other preexisting G2P systems.
arXiv Detail & Related papers (2020-04-07T05:44:58Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.