Punctuation restoration Model and Spacing Model for Korean Ancient
Document
- URL: http://arxiv.org/abs/2312.11881v1
- Date: Tue, 19 Dec 2023 06:15:52 GMT
- Title: Punctuation restoration Model and Spacing Model for Korean Ancient
Document
- Authors: Taehong Jang, Joonmo Ahn, Sojung Lucia Kim
- Abstract summary: In Korean ancient documents, there is no spacing or punctuation, and they are written in classical Chinese characters.
While China has models predicting punctuation and spacing, applying them directly to Korean texts is problematic due to data differences.
We developed the first models which predict punctuation and spacing for Korean historical texts and evaluated their performance.
- Score: 0.5524804393257919
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In Korean ancient documents, there is no spacing or punctuation, and they are
written in classical Chinese characters. This makes it challenging for modern
individuals and translation models to accurately interpret and translate them.
While China has models predicting punctuation and spacing, applying them
directly to Korean texts is problematic due to data differences. Therefore, we
developed the first models which predict punctuation and spacing for Korean
historical texts and evaluated their performance. Our punctuation restoration
model achieved an F1 score of 0.84, and Spacing model achieved a score of 0.96.
It has the advantage of enabling inference on low-performance GPUs with less
VRAM while maintaining quite high accuracy.
Related papers
- Xmodel-1.5: An 1B-scale Multilingual LLM [4.298869484709548]
We introduce Xmodel-1.5, a multilingual large language model pretrained on 2 trillion tokens.
Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy.
The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English.
arXiv Detail & Related papers (2024-11-15T10:01:52Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction [0.0]
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities.
We propose two data augmentation methods to address these limitations.
Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos.
arXiv Detail & Related papers (2024-09-08T14:29:10Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic
N-Gram Rule Generation for Spelling Normalization in Filipino [0.0]
84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications.
We propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction.
arXiv Detail & Related papers (2022-10-06T04:41:26Z) - Translating Hanja Historical Documents to Contemporary Korean and
English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - An Alignment-Agnostic Model for Chinese Text Error Correction [17.429266115653007]
This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters.
Most existing models can correct mistaken characters errors, but they cannot deal with missing or redundant characters.
We propose a novel detect-correct framework which is alignment-agnostic, meaning that it can handle both text aligned and non-aligned occasions.
arXiv Detail & Related papers (2021-04-15T01:17:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.