Related papers: InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling

InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling

URL: http://arxiv.org/abs/2508.15791v1
Date: Tue, 12 Aug 2025 11:53:57 GMT
Title: InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling
Authors: Xiaolei Diao, Zhihan Zhou, Lida Shi, Ting Wang, Ruihua Qi, Hao Xu, Daqian Shi,
Abstract summary: InteChar is a character list that integrates unencoded oracle bone characters with traditional and modern Chinese.<n>We construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation.<n>Experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks.
Score: 19.419729615830466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Constructing historical language models (LMs) plays a crucial role in aiding archaeological provenance studies and understanding ancient cultures. However, existing resources present major challenges for training effective LMs on historical texts. First, the scarcity of historical language samples renders unsupervised learning approaches based on large text corpora highly inefficient, hindering effective pre-training. Moreover, due to the considerable temporal gap and complex evolution of ancient scripts, the absence of comprehensive character encoding schemes limits the digitization and computational processing of ancient texts, particularly in early Chinese writing. To address these challenges, we introduce InteChar, a unified and extensible character list that integrates unencoded oracle bone characters with traditional and modern Chinese. InteChar enables consistent digitization and representation of historical texts, providing a foundation for robust modeling of ancient scripts. To evaluate the effectiveness of InteChar, we construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation, centered on Chinese oracle bone inscriptions. Extensive experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks, confirming the effectiveness of our approach and establishing a solid foundation for future research in ancient Chinese NLP.

Related papers

AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora [20.655514486215196]
The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters.<n>The AncientBench aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents.<n>The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more.
arXiv Detail & Related papers (2025-12-19T16:28:57Z)
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography [58.790901822971094]
Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations.<n>Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered.<n>This paper proposes a novel two-stage semantic framework, named OracleFusion.
arXiv Detail & Related papers (2025-06-26T08:56:07Z)
Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages [0.18846515534317265]
Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines.<n>This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts.
arXiv Detail & Related papers (2025-06-21T13:33:07Z)
ParsiPy: NLP Toolkit for Historical Persian Texts in Python [1.637832760977605]
This work introduces ParsiPy, an NLP toolkit to handle phonetic transcriptions and analyze ancient texts.<n>ParsiPy offers modules for tokenization, lemmatization, part-of-speech tagging, phoneme-to-transliteration conversion, and word embedding.
arXiv Detail & Related papers (2025-03-22T16:21:29Z)
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks.<n>We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks.<n>Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z)
Skeleton and Font Generation Network for Zero-shot Chinese Character Generation [53.08596064763731]
We propose a novel Skeleton and Font Generation Network (SFGN) to achieve a more robust Chinese character font generation.<n>We conduct experiments on misspelled characters, a substantial portion of which slightly differs from the common ones.<n>Our approach visually demonstrates the efficacy of generated images and outperforms current state-of-the-art font generation methods.
arXiv Detail & Related papers (2025-01-14T12:15:49Z)
Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z)
Deciphering Oracle Bone Language with Diffusion Models [70.69739681961558]
Oracle Bone Script (OBS) originated from China's Shang Dynasty approximately 3,000 years ago.<n>This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD)<n>OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages.
arXiv Detail & Related papers (2024-06-02T09:42:23Z)
An open dataset for the evolution of oracle bone characters: EVOBC [72.91231825135665]
The earliest extant Chinese characters originate from oracle bone inscriptions, which are closely related to other East Asian languages. In this study, we systematically collected ancient characters from authoritative texts and websites spanning six historical stages. We constructed an extensive dataset, consisting of 229,170 images representing 13,714 distinct character categories.
arXiv Detail & Related papers (2024-01-23T03:30:47Z)
GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts [11.289265479095956]
GujiBERT and GujiGPT language models are foundational models specifically designed for intelligent information processing of ancient texts. These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters. These models have exhibited exceptional performance across a range of validation tasks using publicly available datasets.
arXiv Detail & Related papers (2023-07-11T15:44:01Z)
Interactive Fiction Game Playing as Multi-Paragraph Reading Comprehension with Reinforcement Learning [94.50608198582636]
Interactive Fiction (IF) games with real human-written natural language texts provide a new natural evaluation for language understanding techniques. We take a novel perspective of IF game solving and re-formulate it as Multi-Passage Reading (MPRC) tasks.
arXiv Detail & Related papers (2020-10-05T23:09:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.