Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training
- URL: http://arxiv.org/abs/2305.18760v1
- Date: Tue, 30 May 2023 05:48:36 GMT
- Title: Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training
- Authors: Yuxuan Wang, Jianghui Wang, Dongyan Zhao, and Zilong Zheng
- Abstract summary: We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
- Score: 50.100992353488174
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce CDBERT, a new learning paradigm that enhances the semantics
understanding ability of the Chinese PLMs with dictionary knowledge and
structure of Chinese characters. We name the two core modules of CDBERT as
Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most
appropriate meaning from Chinese dictionaries and Jiezi refers to the process
of enhancing characters' glyph representations with structure understanding. To
facilitate dictionary understanding, we propose three pre-training tasks, i.e.,
Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and
Example Learning. We evaluate our method on both modern Chinese understanding
benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new
polysemy discrimination task PolyMRC based on the collected dictionary of
ancient Chinese. Our paradigm demonstrates consistent improvements on previous
Chinese PLMs across all tasks. Moreover, our approach yields significant
boosting on few-shot setting of ancient Chinese understanding.
Related papers
- Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models [42.75756994523378]
We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words.
We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT.
MigBERT achieves new SOTA performance on all these tasks.
arXiv Detail & Related papers (2023-03-20T06:20:03Z) - Exploring and Adapting Chinese GPT to Pinyin Input Method [48.15790080309427]
We make the first exploration to leverage Chinese GPT for pinyin input method.
A frozen GPT achieves state-of-the-art performance on perfect pinyin.
However, the performance drops dramatically when the input includes abbreviated pinyin.
arXiv Detail & Related papers (2022-03-01T06:05:07Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - An In-depth Study on Internal Structure of Chinese Words [34.864343591706984]
This work proposes to model the deep internal structures of Chinese words as dependency trees with 11 labels for distinguishing syntactic relationships.
We manually annotate a word-internal structure treebank (WIST) consisting of over 30K multi-char words from Chinese Penn Treebank.
We present detailed and interesting analysis on WIST to reveal insights on Chinese word formation.
arXiv Detail & Related papers (2021-06-01T09:09:51Z) - Chinese Lexical Simplification [29.464388721085548]
There is no research work for Chinese lexical simplification ( CLS) task.
To circumvent difficulties in acquiring annotations, we manually create the first benchmark dataset for CLS.
We present five different types of methods as baselines to generate substitute candidates for the complex word.
arXiv Detail & Related papers (2020-10-14T12:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.