WenyanGPT: A Large Language Model for Classical Chinese Tasks
- URL: http://arxiv.org/abs/2504.20609v1
- Date: Tue, 29 Apr 2025 10:19:05 GMT
- Title: WenyanGPT: A Large Language Model for Classical Chinese Tasks
- Authors: Xinyu Yao, Mengdi Wang, Bo Chen, Xiaobing Zhao,
- Abstract summary: Existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese.<n>By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks.
- Score: 36.380841559581945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model's training data, instruction fine-tuning data\footnote, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.
Related papers
- FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs.
We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses.
Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z) - Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model [36.01840141194335]
We introduce CT-LLM, a 2B large language model (LLM)
Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by incorporating Chinese textual data.
CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT.
arXiv Detail & Related papers (2024-04-05T15:20:02Z) - Code-Based English Models Surprising Performance on Chinese QA Pair
Extraction Task [17.117337927315315]
Code-based models consistently perform better than text-based models in reasoning-intensive scenarios.
Code-based models containing a certain amount of Chinese data achieve even better performance.
The capabilities of code-based English models in specified Chinese tasks offer a distinct perspective for discussion on the philosophical "Chinese Room" thought experiment.
arXiv Detail & Related papers (2024-01-16T02:11:35Z) - GujiBERT and GujiGPT: Construction of Intelligent Information Processing
Foundation Language Models for Ancient Texts [11.289265479095956]
GujiBERT and GujiGPT language models are foundational models specifically designed for intelligent information processing of ancient texts.
These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters.
These models have exhibited exceptional performance across a range of validation tasks using publicly available datasets.
arXiv Detail & Related papers (2023-07-11T15:44:01Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - WYWEB: A NLP Evaluation Benchmark For Classical Chinese [10.138128038929237]
We introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese.
We evaluate the existing pre-trained language models, which are all struggling with this benchmark.
arXiv Detail & Related papers (2023-05-23T15:15:11Z) - Extending the Pre-Training of BLOOM for Improved Support of Traditional
Chinese: Models, Methods and Results [12.00277814051069]
BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022.
We extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains.
BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability.
arXiv Detail & Related papers (2023-03-08T16:53:19Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - StyleBERT: Chinese pretraining by font style information [14.585511561131078]
The experiments show that the model achieves well performances on a wide range of Chinese NLP tasks.
Unlike the English language, Chinese has its special characters such as glyph information.
arXiv Detail & Related papers (2022-02-21T02:45:12Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z) - Revisiting Pre-Trained Models for Chinese Natural Language Processing [73.65780892128389]
We revisit Chinese pre-trained language models to examine their effectiveness in a non-English language.
We also propose a model called MacBERT, which improves upon RoBERTa in several ways.
arXiv Detail & Related papers (2020-04-29T02:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.