Learning to Pronounce Chinese Without a Pronunciation Dictionary
- URL: http://arxiv.org/abs/2010.04744v1
- Date: Fri, 9 Oct 2020 18:03:49 GMT
- Title: Learning to Pronounce Chinese Without a Pronunciation Dictionary
- Authors: Christopher Chu, Scot Fang and Kevin Knight
- Abstract summary: We demonstrate a program that learns to pronounce Chinese text in Mandarin, without a pronunciation dictionary.
From non-parallel streams of Chinese characters and Chinese pinyin syllables, it establishes a many-to-many mapping between characters and pronunciations.
Its token-level character-to-syllable accuracy is 89%, which significantly exceeds the 22% accuracy of prior work.
- Score: 10.622817647136667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We demonstrate a program that learns to pronounce Chinese text in Mandarin,
without a pronunciation dictionary. From non-parallel streams of Chinese
characters and Chinese pinyin syllables, it establishes a many-to-many mapping
between characters and pronunciations. Using unsupervised methods, the program
effectively deciphers writing into speech. Its token-level
character-to-syllable accuracy is 89%, which significantly exceeds the 22%
accuracy of prior work.
Related papers
- Exploring the Usage of Chinese Pinyin in Pretraining [28.875174965608554]
Pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors.
In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT.
arXiv Detail & Related papers (2023-10-08T01:26:44Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - "Is Whole Word Masking Always Better for Chinese BERT?": Probing on
Chinese Grammatical Error Correction [58.40808660657153]
We investigate whether whole word masking (WWM) leads to better context understanding ability for Chinese BERT.
We construct a dataset including labels for 19,075 tokens in 10,448 sentences.
We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively.
arXiv Detail & Related papers (2022-03-01T08:24:56Z) - Exploring and Adapting Chinese GPT to Pinyin Input Method [48.15790080309427]
We make the first exploration to leverage Chinese GPT for pinyin input method.
A frozen GPT achieves state-of-the-art performance on perfect pinyin.
However, the performance drops dramatically when the input includes abbreviated pinyin.
arXiv Detail & Related papers (2022-03-01T06:05:07Z) - Decoupling recognition and transcription in Mandarin ASR [21.36547395115413]
We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese.
Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.
arXiv Detail & Related papers (2021-08-02T19:09:41Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.