Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction
- URL: http://arxiv.org/abs/2406.03019v1
- Date: Wed, 5 Jun 2024 07:34:39 GMT
- Title: Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction
- Authors: Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu,
- Abstract summary: Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
- Score: 73.26364649572237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters through radical reconstruction. We deconstruct OBI into foundational strokes and radicals, then employ a Transformer model to reconstruct them into their modern (conterpart)\textcolor{blue}{counterparts}, offering a groundbreaking solution to ancient script analysis. To further this endeavor, a new Ancient Chinese Character Puzzles (ACCP) dataset was developed, comprising an extensive collection of character images from seven key historical stages, annotated with detailed radical sequences. The experiments have showcased considerable promising insights, underscoring the potential and effectiveness of our approach in deciphering the intricacies of ancient Chinese scripts. Through this novel dataset and methodology, we aim to bridge the gap between traditional philology and modern document analysis techniques, offering new insights into the rich history of Chinese linguistic heritage.
Related papers
- Semi-supervised Chinese Poem-to-Painting Generation via Cycle-consistent Adversarial Networks [2.250406890348191]
We propose a semi-supervised approach using cycle-consistent adversarial networks to leverage the limited paired data.
We introduce novel evaluation metrics to assess the quality, diversity, and consistency of the generated poems and paintings.
The proposed model outperforms previous methods, showing promise in capturing the symbolic essence of artistic expression.
arXiv Detail & Related papers (2024-10-25T04:57:44Z) - A Cross-Font Image Retrieval Network for Recognizing Undeciphered Oracle Bone Inscriptions [12.664292922995532]
Oracle Bone Inscription (OBI) is the earliest mature writing system known in China to date.
We propose a cross-font image retrieval network (CFIRN) to decipher OBI characters.
arXiv Detail & Related papers (2024-09-10T10:04:58Z) - Deciphering Oracle Bone Language with Diffusion Models [70.69739681961558]
Oracle Bone Script (OBS) originated from China's Shang Dynasty approximately 3,000 years ago.
This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD)
OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages.
arXiv Detail & Related papers (2024-06-02T09:42:23Z) - An open dataset for oracle bone script recognition and decipherment [66.35957530824872]
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years.
The passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts.
With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option.
This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of 140,
arXiv Detail & Related papers (2024-01-27T09:54:16Z) - An open dataset for the evolution of oracle bone characters: EVOBC [72.91231825135665]
The earliest extant Chinese characters originate from oracle bone inscriptions, which are closely related to other East Asian languages.
In this study, we systematically collected ancient characters from authoritative texts and websites spanning six historical stages.
We constructed an extensive dataset, consisting of 229,170 images representing 13,714 distinct character categories.
arXiv Detail & Related papers (2024-01-23T03:30:47Z) - The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and
POS [3.9227136203353865]
We propose a framework for ancient Chinese Word and Part-of-Speech Tagging.
On the one hand, we try to capture the wordhood semantics; on the other hand, we re-predict the uncertain samples of baseline model.
The performance of our architecture outperforms pre-trained BERT with CRF and existing tools such as Jiayan.
arXiv Detail & Related papers (2023-10-12T16:55:44Z) - GujiBERT and GujiGPT: Construction of Intelligent Information Processing
Foundation Language Models for Ancient Texts [11.289265479095956]
GujiBERT and GujiGPT language models are foundational models specifically designed for intelligent information processing of ancient texts.
These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters.
These models have exhibited exceptional performance across a range of validation tasks using publicly available datasets.
arXiv Detail & Related papers (2023-07-11T15:44:01Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.