Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
- URL: http://arxiv.org/abs/2509.09731v1
- Date: Wed, 10 Sep 2025 13:02:29 GMT
- Title: Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
- Authors: Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li,
- Abstract summary: We present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess Vision-Language Models (VLMs)<n>AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages.<n>Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.
- Score: 37.68293827920165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.
Related papers
- NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain [49.3943974580576]
This paper presents NeuCLIRTech, an evaluation collection for cross-language retrieval over technical information.<n>The collection consists of technical documents written in Chinese and those same documents machine translated into English.<n>The collection supports two retrieval scenarios: monolingual retrieval in Chinese, and cross-language retrieval with English as the query language.
arXiv Detail & Related papers (2026-02-05T05:57:55Z) - AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora [20.655514486215196]
The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters.<n>The AncientBench aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents.<n>The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more.
arXiv Detail & Related papers (2025-12-19T16:28:57Z) - VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z) - Enhancement of text recognition for hanja handwritten documents of Ancient Korea [0.769672852567215]
We implement a high-performance optical character recognition model for classical handwritten documents.<n>The recognition of hanja handwritten documents is a meaningful and special challenge.
arXiv Detail & Related papers (2024-12-14T02:29:07Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs [43.1380542830147]
We introduce CKnowEdit, the first-ever Chinese knowledge editing dataset designed to correct linguistic, factual, and logical errors in Large Language Models (LLMs)<n>We collect seven types of knowledge from a wide range of sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba.<n>By analyzing this dataset, we highlight the challenges current LLMs face in mastering Chinese.
arXiv Detail & Related papers (2024-09-09T17:11:51Z) - Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large
Language Models [15.490610582567543]
AC-EVAL is a benchmark designed to assess the advanced knowledge and reasoning capabilities of Large Language Models (LLMs)
The benchmark comprises 13 tasks, spanning historical facts, geography, social customs, art, philosophy, classical poetry and prose.
Our evaluation of top-performing LLMs, tailored for both English and Chinese, reveals a substantial potential for enhancing ancient text comprehension.
arXiv Detail & Related papers (2024-03-11T10:24:37Z) - Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test
on ACLUE [23.598825660594926]
ACLUE is an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese.
We observed a noticeable disparity in their performance between modern Chinese and ancient Chinese.
ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%.
arXiv Detail & Related papers (2023-10-14T10:06:39Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine
Reading Comprehension [9.66226932673554]
Native Chinese Reader is a new machine reading comprehension dataset with particularly long articles in both modern and classical Chinese.
NCR is collected from the exam questions for the Chinese course in China's high schools, which are designed to evaluate the language proficiency of native Chinese youth.
arXiv Detail & Related papers (2021-12-13T09:11:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.