The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and
POS
- URL: http://arxiv.org/abs/2310.08496v1
- Date: Thu, 12 Oct 2023 16:55:44 GMT
- Title: The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and
POS
- Authors: Pengyu Wang, Zhichen Ren
- Abstract summary: We propose a framework for ancient Chinese Word and Part-of-Speech Tagging.
On the one hand, we try to capture the wordhood semantics; on the other hand, we re-predict the uncertain samples of baseline model.
The performance of our architecture outperforms pre-trained BERT with CRF and existing tools such as Jiayan.
- Score: 3.9227136203353865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic analysis for modern Chinese has greatly improved the accuracy of
text mining in related fields, but the study of ancient Chinese is still
relatively rare. Ancient text division and lexical annotation are important
parts of classical literature comprehension, and previous studies have tried to
construct auxiliary dictionary and other fused knowledge to improve the
performance. In this paper, we propose a framework for ancient Chinese Word
Segmentation and Part-of-Speech Tagging that makes a twofold effort: on the one
hand, we try to capture the wordhood semantics; on the other hand, we
re-predict the uncertain samples of baseline model by introducing external
knowledge. The performance of our architecture outperforms pre-trained BERT
with CRF and existing tools such as Jiayan.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Improving Chinese Story Generation via Awareness of Syntactic
Dependencies and Semantics [17.04903530992664]
We present a new generation framework that enhances the feature mechanism by informing the generation model of dependencies between words.
We conduct a range of experiments, and the results demonstrate that our framework outperforms the state-of-the-art Chinese generation models on all evaluation metrics.
arXiv Detail & Related papers (2022-10-19T15:01:52Z) - Contextual Similarity is More Valuable than Character Similarity:
Curriculum Learning for Chinese Spell Checking [26.93594761258908]
Chinese Spell Checking (CSC) task aims to detect and correct Chinese spelling errors.
To make better use of contextual similarity, we propose a simple yet effective curriculum learning framework for the CSC task.
With the help of our designed model-agnostic framework, existing CSC models will be trained from easy to difficult as humans learn Chinese characters.
arXiv Detail & Related papers (2022-07-17T03:12:27Z) - HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic
Analysis [3.2851864672627618]
We present a pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus of Historical American English.
We report promising results in word similarity and semantic shift analysis.
arXiv Detail & Related papers (2022-02-08T02:53:48Z) - Extract, Integrate, Compete: Towards Verification Style Reading
Comprehension [66.2551168928688]
We present a new verification style reading comprehension dataset named VGaokao from Chinese Language tests of Gaokao.
To address the challenges in VGaokao, we propose a novel Extract-Integrate-Compete approach.
arXiv Detail & Related papers (2021-09-11T01:34:59Z) - Time-Aware Ancient Chinese Text Translation and Inference [6.787414471399024]
We aim to address the challenges surrounding the translation of ancient Chinese text.
The linguistic gap due to the difference in eras results in translations that are poor in quality.
Most translations are missing the contextual information that is often very crucial to understanding the text.
arXiv Detail & Related papers (2021-07-07T12:23:52Z) - Are Neural Language Models Good Plagiarists? A Benchmark for Neural
Paraphrase Detection [5.847824494580938]
We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture.
Our contribution fosters future research of paraphrase detection systems as it offers a large collection of aligned original and paraphrased documents.
arXiv Detail & Related papers (2021-03-23T11:01:35Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.