Exploring the Usage of Chinese Pinyin in Pretraining
- URL: http://arxiv.org/abs/2310.04960v1
- Date: Sun, 8 Oct 2023 01:26:44 GMT
- Title: Exploring the Usage of Chinese Pinyin in Pretraining
- Authors: Baojun Wang, Kun Xu, Lifeng Shang
- Abstract summary: Pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors.
In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT.
- Score: 28.875174965608554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlike alphabetic languages, Chinese spelling and pronunciation are
different. Both characters and pinyin take an important role in Chinese
language understanding. In Chinese NLP tasks, we almost adopt characters or
words as model input, and few works study how to use pinyin. However, pinyin is
essential in many scenarios, such as error correction and fault tolerance for
ASR-introduced errors. Most of these errors are caused by the same or similar
pronunciation words, and we refer to this type of error as SSP(the same or
similar pronunciation) errors for short. In this work, we explore various ways
of using pinyin in pretraining models and propose a new pretraining method
called PmBERT. Our method uses characters and pinyin in parallel for
pretraining. Through delicate pretraining tasks, the characters and pinyin
representation are fused, which can enhance the error tolerance for SSP errors.
We do comprehensive experiments and ablation tests to explore what makes a
robust phonetic enhanced Chinese language model. The experimental results on
both the constructed noise-added dataset and the public error-correction
dataset demonstrate that our model is more robust compared to SOTA models.
Related papers
- Large Language Model Should Understand Pinyin for Chinese ASR Error Correction [31.13523648668466]
We propose Pinyin-enhanced GEC to improve Chinese ASR error correction.
Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference.
Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input.
arXiv Detail & Related papers (2024-09-20T06:50:56Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - Exploring and Adapting Chinese GPT to Pinyin Input Method [48.15790080309427]
We make the first exploration to leverage Chinese GPT for pinyin input method.
A frozen GPT achieves state-of-the-art performance on perfect pinyin.
However, the performance drops dramatically when the input includes abbreviated pinyin.
arXiv Detail & Related papers (2022-03-01T06:05:07Z) - Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech
Recognition with Pinyin and Character [15.999657143705045]
Pinyin and character as writing and spelling systems respectively are mutual promotion in the Mandarin Chinese language.
We propose a novel Mandarin Chinese ASR model with dual-decoder Transformer according to the characteristics of pinyin transcripts and character transcripts.
The results on the test sets of AISHELL-1 dataset show that the proposed Speech-Pinyin-Character-Interaction (S PCI) model without a language model achieves 9.85% character error rate (CER) on the test set.
arXiv Detail & Related papers (2022-01-26T07:59:03Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
Spell Checking [20.74049189959078]
We propose a Chinese spell checker called ReaLiSe, by directly leveraging the multimodal information of the Chinese characters.
The ReaLiSe tackles model the CSC task by (1) capturing the semantic, phonetic and graphic information of the input characters, and (2) mixing the information in these modalities to predict the correct output.
Experiments on the SIGHAN benchmarks show that the proposed model outperforms strong baselines by a large margin.
arXiv Detail & Related papers (2021-05-26T02:38:11Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.