When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun
- URL: http://arxiv.org/abs/2411.04822v1
- Date: Thu, 07 Nov 2024 15:59:54 GMT
- Title: When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun
- Authors: Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh,
- Abstract summary: We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
- Score: 48.07219104902607
- License:
- Abstract: Historical and linguistic connections within the Sinosphere have led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These mixed results emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.
Related papers
- The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and
POS [3.9227136203353865]
We propose a framework for ancient Chinese Word and Part-of-Speech Tagging.
On the one hand, we try to capture the wordhood semantics; on the other hand, we re-predict the uncertain samples of baseline model.
The performance of our architecture outperforms pre-trained BERT with CRF and existing tools such as Jiayan.
arXiv Detail & Related papers (2023-10-12T16:55:44Z) - HistRED: A Historical Document-Level Relation Extraction Dataset [32.96963890713529]
HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
arXiv Detail & Related papers (2023-07-10T00:24:27Z) - Discourse Representation Structure Parsing for Chinese [8.846860617823005]
We explore the feasibility of Chinese semantic parsing in the absence of labeled data for Chinese meaning representations.
We propose a test suite designed explicitly for Chinese semantic parsing, which provides fine-grained evaluation for parsing performance.
Our experimental results show that the difficulty of Chinese semantic parsing is mainly caused by adverbs.
arXiv Detail & Related papers (2023-06-16T09:47:45Z) - Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models [17.749113496737106]
We construct the first Classical-Chinese-to-Kanbun dataset in the world.
Character reordering and machine translation play a significant role in Kanbun comprehension.
We release our code and dataset on GitHub.
arXiv Detail & Related papers (2023-05-22T06:30:02Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Translating Hanja Historical Documents to Contemporary Korean and
English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z) - Cross-Lingual Training with Dense Retrieval for Document Retrieval [56.319511218754414]
We explore different transfer techniques for document ranking from English annotations to multiple non-English languages.
Experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families.
We find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer.
arXiv Detail & Related papers (2021-09-03T17:15:38Z) - KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for
Kinyarwanda and Kirundi [18.01565807026177]
We introduce two news datasets for classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages.
We provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models.
Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi.
arXiv Detail & Related papers (2020-10-23T05:37:42Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.