Related papers: AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

URL: http://arxiv.org/abs/2512.17756v1
Date: Fri, 19 Dec 2025 16:28:57 GMT
Title: AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora
Authors: Zhihan Zhou, Daqian Shi, Rui Song, Lida Shi, Xiaolei Diao, Hao Xu,
Abstract summary: The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters.<n>The AncientBench aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents.<n>The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more.
Score: 20.655514486215196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.

Related papers

Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning [37.68293827920165]
We present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess Vision-Language Models (VLMs)<n>AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages.<n>Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.
arXiv Detail & Related papers (2025-09-10T13:02:29Z)
EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing [47.704427451419456]
benchName is a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository.<n>We develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores.<n>We benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types.
arXiv Detail & Related papers (2025-06-03T08:14:46Z)
Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents [60.348103523743276]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.<n>Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z)
Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z)
AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models [15.490610582567543]
AC-EVAL is a benchmark designed to assess the advanced knowledge and reasoning capabilities of Large Language Models (LLMs) The benchmark comprises 13 tasks, spanning historical facts, geography, social customs, art, philosophy, classical poetry and prose. Our evaluation of top-performing LLMs, tailored for both English and Chinese, reveals a substantial potential for enhancing ancient text comprehension.
arXiv Detail & Related papers (2024-03-11T10:24:37Z)
Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE [23.598825660594926]
ACLUE is an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese. We observed a noticeable disparity in their performance between modern Chinese and ancient Chinese. ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%.
arXiv Detail & Related papers (2023-10-14T10:06:39Z)
Towards Effective Ancient Chinese Translation: Dataset, Model, and Evaluation [28.930640246972516]
In this paper, we propose Erya for ancient Chinese translation. From a dataset perspective, we collect, clean, and classify ancient Chinese materials from various sources. From a model perspective, we devise Erya training method oriented towards ancient Chinese.
arXiv Detail & Related papers (2023-08-01T02:43:27Z)
Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z)
SenteCon: Leveraging Lexicons to Learn Human-Interpretable Language Representations [51.08119762844217]
SenteCon is a method for introducing human interpretability in deep language representations. We show that SenteCon provides high-level interpretability at little to no cost to predictive performance on downstream tasks.
arXiv Detail & Related papers (2023-05-24T05:06:28Z)
AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation [22.08457469951396]
AnchiBERT is a pre-trained language model based on the architecture of BERT. We evaluate AnchiBERT on both language understanding and generation tasks, including poem classification.
arXiv Detail & Related papers (2020-09-24T03:41:13Z)
Generating Major Types of Chinese Classical Poetry in a Uniformed Framework [88.57587722069239]
We propose a GPT-2 based framework for generating major types of Chinese classical poems. Preliminary results show this enhanced model can generate Chinese classical poems of major types with high quality in both form and content.
arXiv Detail & Related papers (2020-03-13T14:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.