ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information
- URL: http://arxiv.org/abs/2106.16038v1
- Date: Wed, 30 Jun 2021 13:06:00 GMT
- Title: ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information
- Authors: Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei
Wu, Jiwei Li
- Abstract summary: We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
- Score: 32.70080326854314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent pretraining models in Chinese neglect two important aspects specific
to the Chinese language: glyph and pinyin, which carry significant syntax and
semantic information for language understanding. In this work, we propose
ChineseBERT, which incorporates both the {\it glyph} and {\it pinyin}
information of Chinese characters into language model pretraining. The glyph
embedding is obtained based on different fonts of a Chinese character, being
able to capture character semantics from the visual features, and the pinyin
embedding characterizes the pronunciation of Chinese characters, which handles
the highly prevalent heteronym phenomenon in Chinese (the same character has
different pronunciations with different meanings). Pretrained on large-scale
unlabeled Chinese corpus, the proposed ChineseBERT model yields significant
performance boost over baseline models with fewer training steps. The porpsoed
model achieves new SOTA performances on a wide range of Chinese NLP tasks,
including machine reading comprehension, natural language inference, text
classification, sentence pair matching, and competitive performances in named
entity recognition. Code and pretrained models are publicly available at
https://github.com/ShannonAI/ChineseBert.
Related papers
- Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with
Images as Pivots [80.32906566894171]
We propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese.
IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently.
Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%10% training data.
arXiv Detail & Related papers (2023-05-19T09:20:27Z) - Stroke-Based Autoencoders: Self-Supervised Learners for Efficient
Zero-Shot Chinese Character Recognition [4.64065792373245]
We develop a stroke-based autoencoder to model the sophisticated morphology of Chinese characters.
Our SAE architecture outperforms other existing methods in zero-shot recognition.
arXiv Detail & Related papers (2022-07-17T14:39:10Z) - Exploring and Adapting Chinese GPT to Pinyin Input Method [48.15790080309427]
We make the first exploration to leverage Chinese GPT for pinyin input method.
A frozen GPT achieves state-of-the-art performance on perfect pinyin.
However, the performance drops dramatically when the input includes abbreviated pinyin.
arXiv Detail & Related papers (2022-03-01T06:05:07Z) - StyleBERT: Chinese pretraining by font style information [14.585511561131078]
The experiments show that the model achieves well performances on a wide range of Chinese NLP tasks.
Unlike the English language, Chinese has its special characters such as glyph information.
arXiv Detail & Related papers (2022-02-21T02:45:12Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - CalliGAN: Style and Structure-aware Chinese Calligraphy Character
Generator [6.440233787863018]
Chinese calligraphy is the writing of Chinese characters as an art form performed with brushes.
Recent studies show that Chinese characters can be generated through image-to-image translation for multiple styles using a single model.
We propose a novel method of this approach by incorporating Chinese characters' component information into its model.
arXiv Detail & Related papers (2020-05-26T03:15:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.