Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary
- URL: http://arxiv.org/abs/2306.14592v1
- Date: Mon, 26 Jun 2023 11:00:35 GMT
- Title: Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary
- Authors: Sojung Lucia Kim and Taehong Jang and Joonmo Ahn and Hyungil Lee and
Jaehyuk Lee
- Abstract summary: NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
- Score: 1.105375732595832
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A named entity recognition and classification plays the first and foremost
important role in capturing semantics in data and anchoring in translation as
well as downstream study for history. However, NER in historical text has faced
challenges such as scarcity of annotated corpus, multilanguage variety, various
noise, and different convention far different from the contemporary language
model. This paper introduces Korean historical corpus (Diary of Royal secretary
which is named SeungJeongWon) recorded over several centuries and recently
added with named entity information as well as phrase markers which historians
carefully annotated. We fined-tuned the language model on history corpus,
conducted extensive comparative experiments using our language model and
pretrained muti-language models. We set up the hypothesis of combination of
time and annotation information and tested it based on statistical t test. Our
finding shows that phrase markers clearly improve the performance of NER model
in predicting unseen entity in documents written far different time period. It
also shows that each of phrase marker and corpus-specific trained model does
not improve the performance. We discuss the future research directions and
practical strategies to decipher the history document.
Related papers
- Contrastive Entity Coreference and Disambiguation for Historical Texts [2.446672595462589]
Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases.
This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts.
arXiv Detail & Related papers (2024-06-21T18:22:14Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - People and Places of Historical Europe: Bootstrapping Annotation
Pipeline and a New Corpus of Named Entities in Late Medieval Texts [0.0]
We develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus.
arXiv Detail & Related papers (2023-05-26T08:05:01Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - hmBERT: Historical Multilingual Language Models for Named Entity
Recognition [0.6226609932118123]
We tackle NER for identifying persons, locations, and organizations in historical texts.
In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models.
arXiv Detail & Related papers (2022-05-31T07:30:33Z) - HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic
Analysis [3.2851864672627618]
We present a pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus of Historical American English.
We report promising results in word similarity and semantic shift analysis.
arXiv Detail & Related papers (2022-02-08T02:53:48Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.