SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining
- URL: http://arxiv.org/abs/2106.00400v1
- Date: Tue, 1 Jun 2021 11:20:02 GMT
- Title: SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining
- Authors: Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang,
Zhiyuan Liu, Maosong Sun
- Abstract summary: We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
- Score: 48.880840711568425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional tokenization methods for Chinese pretrained language models
(PLMs) treat each character as an indivisible token (Devlin et al., 2019),
which ignores the characteristics of the Chinese writing system. In this work,
we comprehensively study the influences of three main factors on the Chinese
tokenization for PLM: pronunciation, glyph (i.e., shape), and word boundary.
Correspondingly, we propose three kinds of tokenizers: 1) SHUOWEN (meaning Talk
Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character),
the glyph-based tokenizers; 3) Word segmented tokenizers, the tokenizers with
Chinese word segmentation. To empirically compare the effectiveness of studied
tokenizers, we pretrain BERT-style language models with them and evaluate the
models on various downstream NLU tasks. We find that SHUOWEN and JIEZI
tokenizers can generally outperform conventional single-character tokenizers,
while Chinese word segmentation shows no benefit as a preprocessing step.
Moreover, the proposed SHUOWEN and JIEZI tokenizers exhibit significantly
better robustness in handling noisy texts. The code and pretrained models will
be publicly released to facilitate linguistically informed Chinese NLP.
Related papers
- Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Exploiting Word Semantics to Enrich Character Representations of Chinese
Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models.
We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab
Pretraining [5.503321733964237]
We first propose a novel method, emphseg_tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization.
Experiments show that emphseg_tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency.
arXiv Detail & Related papers (2020-11-17T10:15:36Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.