A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950
- URL: http://arxiv.org/abs/2503.19844v1
- Date: Tue, 25 Mar 2025 17:07:21 GMT
- Title: A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950
- Authors: Zhao Fang, Liang-Chun Wu, Xuening Kong, Spencer Dean Stewart,
- Abstract summary: This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER)<n> Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.
Related papers
- Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan [0.1979158763744267]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing.<n>This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan.
arXiv Detail & Related papers (2025-03-10T20:16:01Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - Adapting Multilingual Embedding Models to Historical Luxembourgish [5.474797258314828]
This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish.<n>We use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair.<n>We adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models.
arXiv Detail & Related papers (2025-02-11T20:35:29Z) - NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach [0.03187482513047917]
We show how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks for NER in historical documents.<n>Our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools.
arXiv Detail & Related papers (2025-02-04T16:54:23Z) - Large Language Models for Stemming: Promises, Pitfalls and Failures [34.91311006478368]
We investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding.
We compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text.
arXiv Detail & Related papers (2024-02-19T01:11:44Z) - Salute the Classic: Revisiting Challenges of Machine Translation in the
Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges.
These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search.
This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Lexicon and Rule-based Word Lemmatization Approach for the Somali
Language [0.0]
Lemmatization is a technique used to normalize text by changing morphological derivations of words to their root forms.
This paper pioneers the development of text lemmatization for the Somali language.
We have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon.
arXiv Detail & Related papers (2023-08-03T14:31:57Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage
Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging.
Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.