Related papers: Punctuation restoration Model and Spacing Model for Korean Ancient Document

Punctuation restoration Model and Spacing Model for Korean Ancient Document

URL: http://arxiv.org/abs/2312.11881v1
Date: Tue, 19 Dec 2023 06:15:52 GMT
Title: Punctuation restoration Model and Spacing Model for Korean Ancient Document
Authors: Taehong Jang, Joonmo Ahn, Sojung Lucia Kim
Abstract summary: In Korean ancient documents, there is no spacing or punctuation, and they are written in classical Chinese characters. While China has models predicting punctuation and spacing, applying them directly to Korean texts is problematic due to data differences. We developed the first models which predict punctuation and spacing for Korean historical texts and evaluated their performance.
Score: 0.5524804393257919
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In Korean ancient documents, there is no spacing or punctuation, and they are written in classical Chinese characters. This makes it challenging for modern individuals and translation models to accurately interpret and translate them. While China has models predicting punctuation and spacing, applying them directly to Korean texts is problematic due to data differences. Therefore, we developed the first models which predict punctuation and spacing for Korean historical texts and evaluated their performance. Our punctuation restoration model achieved an F1 score of 0.84, and Spacing model achieved a score of 0.96. It has the advantage of enabling inference on low-performance GPUs with less VRAM while maintaining quite high accuracy.

Related papers

Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts [0.0]
This study aimed to classify Estonian proficiency examination writings (levels A2-C1)<n>Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors.<n>The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.
arXiv Detail & Related papers (2026-02-13T17:06:17Z)
PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise [60.63315470285562]
MiniTruePrefixes is a novel specialized model that better detects factual inconsistencies over text prefixes.<n>We show that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization.
arXiv Detail & Related papers (2025-11-03T09:07:44Z)
Xmodel-1.5: An 1B-scale Multilingual LLM [4.298869484709548]
We introduce Xmodel-1.5, a multilingual large language model pretrained on 2 trillion tokens. Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English.
arXiv Detail & Related papers (2024-11-15T10:01:52Z)
When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun. Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z)
EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction [0.0]
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. We propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos.
arXiv Detail & Related papers (2024-09-08T14:29:10Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z)
Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino [0.0]
84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications. We propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction.
arXiv Detail & Related papers (2022-10-06T04:41:26Z)
Translating Hanja Historical Documents to Contemporary Korean and English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993. Since then, the records of only one king have been completed in a decade. We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z)
From Good to Best: Two-Stage Training for Cross-lingual Machine Reading Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance. The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer. The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z)
Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages. We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
An Alignment-Agnostic Model for Chinese Text Error Correction [17.429266115653007]
This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters. Most existing models can correct mistaken characters errors, but they cannot deal with missing or redundant characters. We propose a novel detect-correct framework which is alignment-agnostic, meaning that it can handle both text aligned and non-aligned occasions.
arXiv Detail & Related papers (2021-04-15T01:17:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.