Related papers: An open dataset for the evolution of oracle bone characters: EVOBC

An open dataset for the evolution of oracle bone characters: EVOBC

URL: http://arxiv.org/abs/2401.12467v2
Date: Tue, 13 Feb 2024 08:21:50 GMT
Title: An open dataset for the evolution of oracle bone characters: EVOBC
Authors: Haisu Guan, Jinpeng Wan, Yuliang Liu, Pengjie Wang, Kaile Zhang, Zhebin Kuang, Xinyu Wang, Xiang Bai, Lianwen Jin
Abstract summary: The earliest extant Chinese characters originate from oracle bone inscriptions, which are closely related to other East Asian languages. In this study, we systematically collected ancient characters from authoritative texts and websites spanning six historical stages. We constructed an extensive dataset, consisting of 229,170 images representing 13,714 distinct character categories.
Score: 72.91231825135665
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The earliest extant Chinese characters originate from oracle bone inscriptions, which are closely related to other East Asian languages. These inscriptions hold immense value for anthropology and archaeology. However, deciphering oracle bone script remains a formidable challenge, with only approximately 1,600 of the over 4,500 extant characters elucidated to date. Further scholarly investigation is required to comprehensively understand this ancient writing system. Artificial Intelligence technology is a promising avenue for deciphering oracle bone characters, particularly concerning their evolution. However, one of the challenges is the lack of datasets mapping the evolution of these characters over time. In this study, we systematically collected ancient characters from authoritative texts and websites spanning six historical stages: Oracle Bone Characters - OBC (15th century B.C.), Bronze Inscriptions - BI (13th to 221 B.C.), Seal Script - SS (11th to 8th centuries B.C.), Spring and Autumn period Characters - SAC (770 to 476 B.C.), Warring States period Characters - WSC (475 B.C. to 221 B.C.), and Clerical Script - CS (221 B.C. to 220 A.D.). Subsequently, we constructed an extensive dataset, namely EVolution Oracle Bone Characters (EVOBC), consisting of 229,170 images representing 13,714 distinct character categories. We conducted validation and simulated deciphering on the constructed dataset, and the results demonstrate its high efficacy in aiding the study of oracle bone script. This openly accessible dataset aims to digitalize ancient Chinese scripts across multiple eras, facilitating the decipherment of oracle bone script by examining the evolution of glyph forms.

Related papers

InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling [19.419729615830466]
InteChar is a character list that integrates unencoded oracle bone characters with traditional and modern Chinese.<n>We construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation.<n>Experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks.
arXiv Detail & Related papers (2025-08-12T11:53:57Z)
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography [58.790901822971094]
Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations.<n>Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered.<n>This paper proposes a novel two-stage semantic framework, named OracleFusion.
arXiv Detail & Related papers (2025-06-26T08:56:07Z)
Oracle Bone Inscriptions Multi-modal Dataset [58.20314888996118]
Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography. This paper proposes an Oracle Bone Inscriptions Multi-modal dataset, which includes annotation information for 10,077 pieces of oracle bones. This dataset can be used for a variety of AI-related research tasks relevant to the field of OBI, such as OBI Character Detection and Recognition, Rubbing Denoising, Character Matching, Character Generation, Reading Sequence Prediction, Missing Characters Completion task and so on.
arXiv Detail & Related papers (2024-07-04T12:47:32Z)
Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z)
Deciphering Oracle Bone Language with Diffusion Models [70.69739681961558]
Oracle Bone Script (OBS) originated from China's Shang Dynasty approximately 3,000 years ago. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD) OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages.
arXiv Detail & Related papers (2024-06-02T09:42:23Z)
An open dataset for oracle bone script recognition and decipherment [66.35957530824872]
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years. The passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts. With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option. This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of 140,
arXiv Detail & Related papers (2024-01-27T09:54:16Z)
Diff-Oracle: Deciphering Oracle Bone Scripts with Controllable Diffusion Model [48.956844881630886]
Deciphering oracle bone scripts plays an important role in Chinese archaeology and philology. Diff-Oracle is a novel approach based on diffusion models to generate controllable oracle characters. Diff-Oracle substantially benefits downstream oracle character recognition, outperforming all existing SOTAs by a large margin.
arXiv Detail & Related papers (2023-12-21T07:48:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.