StyleBERT: Chinese pretraining by font style information
- URL: http://arxiv.org/abs/2202.09955v2
- Date: Wed, 23 Feb 2022 01:30:45 GMT
- Title: StyleBERT: Chinese pretraining by font style information
- Authors: Chao Lv, Han Zhang, XinKai Du, Yunhao Zhang, Ying Huang, Wenhao Li,
Jia Han, Shanshan Gu
- Abstract summary: The experiments show that the model achieves well performances on a wide range of Chinese NLP tasks.
Unlike the English language, Chinese has its special characters such as glyph information.
- Score: 14.585511561131078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the success of down streaming task using English pre-trained language
model, the pre-trained Chinese language model is also necessary to get a better
performance of Chinese NLP task. Unlike the English language, Chinese has its
special characters such as glyph information. So in this article, we propose
the Chinese pre-trained language model StyleBERT which incorporate the
following embedding information to enhance the savvy of language model, such as
word, pinyin, five stroke and chaizi. The experiments show that the model
achieves well performances on a wide range of Chinese NLP tasks.
Related papers
- Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - TiBERT: Tibetan Pre-trained Language Model [2.9554549423413303]
This paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95$%$ of the words in the corpus by using Sentencepiece.
We apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models.
arXiv Detail & Related papers (2022-05-15T14:45:08Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - CPM: A Large-scale Generative Chinese Pre-trained Language Model [76.65305358932393]
We release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data.
CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning.
arXiv Detail & Related papers (2020-12-01T11:32:56Z) - MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab
Pretraining [5.503321733964237]
We first propose a novel method, emphseg_tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization.
Experiments show that emphseg_tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency.
arXiv Detail & Related papers (2020-11-17T10:15:36Z) - CalliGAN: Style and Structure-aware Chinese Calligraphy Character
Generator [6.440233787863018]
Chinese calligraphy is the writing of Chinese characters as an art form performed with brushes.
Recent studies show that Chinese characters can be generated through image-to-image translation for multiple styles using a single model.
We propose a novel method of this approach by incorporating Chinese characters' component information into its model.
arXiv Detail & Related papers (2020-05-26T03:15:03Z) - Revisiting Pre-Trained Models for Chinese Natural Language Processing [73.65780892128389]
We revisit Chinese pre-trained language models to examine their effectiveness in a non-English language.
We also propose a model called MacBERT, which improves upon RoBERTa in several ways.
arXiv Detail & Related papers (2020-04-29T02:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.