Corpus of Chinese Dynastic Histories: Gender Analysis over Two Millennia
- URL: http://arxiv.org/abs/2005.08793v1
- Date: Mon, 18 May 2020 15:14:33 GMT
- Title: Corpus of Chinese Dynastic Histories: Gender Analysis over Two Millennia
- Authors: Sergey Zinin, Yang Xu
- Abstract summary: dynastic histories form a large continuous linguistic space of approximately 2000 years, from the 3rd century BCE to the 18th century CE.
The histories are documented in Classical (Literary) Chinese in a corpus of over 20 million characters, suitable for the computational analysis of historical lexicon and semantic change.
This project introduces a new open-source corpus of twenty-four dynastic histories covered by Creative Commons license.
- Score: 3.2851864672627618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese dynastic histories form a large continuous linguistic space of
approximately 2000 years, from the 3rd century BCE to the 18th century CE. The
histories are documented in Classical (Literary) Chinese in a corpus of over 20
million characters, suitable for the computational analysis of historical
lexicon and semantic change. However, there is no freely available open-source
corpus of these histories, making Classical Chinese low-resource. This project
introduces a new open-source corpus of twenty-four dynastic histories covered
by Creative Commons license. An original list of Classical Chinese
gender-specific terms was developed as a case study for analyzing the
historical linguistic use of male and female terms. The study demonstrates
considerable stability in the usage of these terms, with dominance of male
terms. Exploration of word meanings uses keyword analysis of focus corpora
created for genderspecific terms. This method yields meaningful semantic
representations that can be used for future studies of diachronic semantics.
Related papers
- Investigating Literary Motifs in Ancient and Medieval Novels with Large Language Models [0.0]
The Greek fictional narratives often termed love novels or romances, ranging from the first century CE to the middle of the 15th century, have long been considered as similar in many ways.
This study aims to investigate which motifs exactly that the texts in this corpus have in common, and in which ways they differ from each other.
arXiv Detail & Related papers (2025-04-30T15:39:06Z) - A Methodology for Studying Linguistic and Cultural Change in China, 1900-1950 [0.0]
This paper presents a quantitative approach to studying linguistic and cultural change in China during the first half of the twentieth century.
The dramatic changes in Chinese language and culture during this time call for greater reflection on the tools and methods used for text analysis.
arXiv Detail & Related papers (2025-02-06T18:33:50Z) - Unveiling Temporal Trends in 19th Century Literature: An Information Retrieval Approach [5.804963603084041]
In English literature, the 19th century witnessed a significant transition in styles, themes, and genres.
This paper explores the evolution of term usage in 19th century English novels through the lens of information retrieval.
arXiv Detail & Related papers (2025-01-12T15:00:10Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Comparative Analysis of Static and Contextual Embeddings for Analyzing Semantic Changes in Medieval Latin Charters [6.883666189245419]
This paper presents the first computational analysis of semantic change pre- and post-Norman Conquest.
It is the first systematic comparison of static and contextual embeddings in a scarce historical data set.
Our findings confirm that, consistent with existing studies, contextual embeddings outperform static word embeddings in capturing semantic change.
arXiv Detail & Related papers (2024-10-11T22:19:17Z) - What an Elegant Bridge: Multilingual LLMs are Biased Similarly in Different Languages [51.0349882045866]
This paper investigates biases of Large Language Models (LLMs) through the lens of grammatical gender.
We prompt a model to describe nouns with adjectives in various languages, focusing specifically on languages with grammatical gender.
We find that a simple classifier can not only predict noun gender above chance but also exhibit cross-language transferability.
arXiv Detail & Related papers (2024-07-12T22:10:16Z) - Temporal Concept Drift and Alignment: An empirical approach to comparing
Knowledge Organization Systems over time [0.0]
This research explores temporal concept drift and temporal alignment in knowledge organization systems (KOS)
A comparative analysis is pursued using the 1910 Library of Congress Subject Headings, 2020 FAST Topical, and automatic indexing.
Results confirm that historical vocabularies can be used to generate anachronistic subject headings representing conceptual drift across time in KOS and historical resources.
arXiv Detail & Related papers (2022-08-16T16:37:17Z) - O-Dang! The Ontology of Dangerous Speech Messages [53.15616413153125]
We present O-Dang!: The Ontology of Dangerous Speech Messages, a systematic and interoperable Knowledge Graph (KG)
O-Dang! is designed to gather and organize Italian datasets into a structured KG, according to the principles shared within the Linguistic Linked Open Data community.
It provides a model for encoding both gold standard and single-annotator labels in the KG.
arXiv Detail & Related papers (2022-07-13T11:50:05Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z) - From Plenipotentiary to Puddingless: Users and Uses of New Words in
Early English Letters [0.0]
We study neologism use in two samples of early English correspondence, from 1640--1660 and 1760--1780.
In both samples, neologisms most frequently occur in letters written between close friends.
In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing.
arXiv Detail & Related papers (2021-03-17T21:45:06Z) - Lexical semantic change for Ancient Greek and Latin [61.69697586178796]
Associating a word's correct meaning in its historical context is a central challenge in diachronic research.
We build on a recent computational approach to semantic change based on a dynamic Bayesian mixture model.
We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models.
arXiv Detail & Related papers (2021-01-22T12:04:08Z) - Evolution of Part-of-Speech in Classical Chinese [2.870517198186329]
Bisang (2008) claimed that Classical Chinese is a precategorical language, where the syntactic position of a word determines its part-of-speech category.
We apply entropy-based metrics to evaluate these claims on historical corpora.
arXiv Detail & Related papers (2020-09-23T13:41:27Z) - A frame semantics based approach to comparative study of digitized
corpus [0.0]
The paper focuses on the morphologic, syntactic, and semantic annotation process of English-Arabic aligned corpus created from a digitized novels.
The present study argues that differences in motion events conceptualization across languages can be described with frame structure and frame-to-frame relations.
arXiv Detail & Related papers (2020-05-29T22:56:25Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.