AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding
and Generation
- URL: http://arxiv.org/abs/2009.11473v2
- Date: Wed, 21 Apr 2021 03:42:41 GMT
- Title: AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding
and Generation
- Authors: Huishuang Tian, Kexin Yang, Dayiheng Liu, Jiancheng Lv
- Abstract summary: AnchiBERT is a pre-trained language model based on the architecture of BERT.
We evaluate AnchiBERT on both language understanding and generation tasks, including poem classification.
- Score: 22.08457469951396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ancient Chinese is the essence of Chinese culture. There are several natural
language processing tasks of ancient Chinese domain, such as ancient-modern
Chinese translation, poem generation, and couplet generation. Previous studies
usually use the supervised models which deeply rely on parallel data. However,
it is difficult to obtain large-scale parallel data of ancient Chinese. In
order to make full use of the more easily available monolingual ancient Chinese
corpora, we release AnchiBERT, a pre-trained language model based on the
architecture of BERT, which is trained on large-scale ancient Chinese corpora.
We evaluate AnchiBERT on both language understanding and generation tasks,
including poem classification, ancient-modern Chinese translation, poem
generation, and couplet generation. The experimental results show that
AnchiBERT outperforms BERT as well as the non-pretrained models and achieves
state-of-the-art results in all cases.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - Towards Effective Ancient Chinese Translation: Dataset, Model, and
Evaluation [28.930640246972516]
In this paper, we propose Erya for ancient Chinese translation.
From a dataset perspective, we collect, clean, and classify ancient Chinese materials from various sources.
From a model perspective, we devise Erya training method oriented towards ancient Chinese.
arXiv Detail & Related papers (2023-08-01T02:43:27Z) - GujiBERT and GujiGPT: Construction of Intelligent Information Processing
Foundation Language Models for Ancient Texts [11.289265479095956]
GujiBERT and GujiGPT language models are foundational models specifically designed for intelligent information processing of ancient texts.
These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters.
These models have exhibited exceptional performance across a range of validation tasks using publicly available datasets.
arXiv Detail & Related papers (2023-07-11T15:44:01Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - TransCouplet:Transformer based Chinese Couplet Generation [1.084959821967413]
Chinese couplet is a form of poetry composed of complex syntax with ancient Chinese language.
This paper presents a transformer-based sequence-to-sequence couplet generation model.
We also evaluate the Glyph, PinYin and Part-of-Speech tagging on the couplet grammatical rules.
arXiv Detail & Related papers (2021-12-03T04:34:48Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - Generating Major Types of Chinese Classical Poetry in a Uniformed
Framework [88.57587722069239]
We propose a GPT-2 based framework for generating major types of Chinese classical poems.
Preliminary results show this enhanced model can generate Chinese classical poems of major types with high quality in both form and content.
arXiv Detail & Related papers (2020-03-13T14:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.