Automated Transcription of Non-Latin Script Periodicals: A Case Study in
the Ottoman Turkish Print Archive
- URL: http://arxiv.org/abs/2011.01139v1
- Date: Mon, 2 Nov 2020 17:28:36 GMT
- Title: Automated Transcription of Non-Latin Script Periodicals: A Case Study in
the Ottoman Turkish Print Archive
- Authors: Suphan Kirmizialtin, David Wrisley
- Abstract summary: Our study utilizes deep learning methods for the automated transcription of periodicals written in Arabic script Ottoman Turkish (OT) using the Transkribus platform.
We discuss the historical situation of OT text collections and how they were excluded for the most part from the late twentieth century corpora digitization.
This exclusion has two basic reasons: the technical challenges of OCR for Arabic script languages, and the rapid abandonment of that very script in the Turkish historical context.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our study utilizes deep learning methods for the automated transcription of
late nineteenth- and early twentieth-century periodicals written in Arabic
script Ottoman Turkish (OT) using the Transkribus platform. We discuss the
historical situation of OT text collections and how they were excluded for the
most part from the late twentieth century corpora digitization that took place
in many Latin script languages. This exclusion has two basic reasons: the
technical challenges of OCR for Arabic script languages, and the rapid
abandonment of that very script in the Turkish historical context. In the
specific case of OT, opening periodical collections to digital tools require
training HTR models to generate transcriptions in the Latin writing system of
contemporary readers of Turkish, and not, as some may expect, in right-to-left
Arabic script text. In the paper we discuss the challenges of training such
models where one-to-one correspondence between the writing systems do not
exist, and we report results based on our HTR experiments with two OT
periodicals from the early twentieth century. Finally, we reflect on potential
domain bias of HTR models in historical languages exhibiting spatio-temporal
variance as well as the significance of working between writing systems for
language communities that have experienced language reform and script change.
Related papers
- Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification [66.69370876902222]
We perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages.<n>We assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches.<n>Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline.
arXiv Detail & Related papers (2025-07-21T12:38:07Z) - ParsiPy: NLP Toolkit for Historical Persian Texts in Python [1.637832760977605]
This work introduces ParsiPy, an NLP toolkit to handle phonetic transcriptions and analyze ancient texts.
ParsiPy offers modules for tokenization, lemmatization, part-of-speech tagging, phoneme-to-transliteration conversion, and word embedding.
arXiv Detail & Related papers (2025-03-22T16:21:29Z) - Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models [0.0]
This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish.
We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language.
We also introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts.
arXiv Detail & Related papers (2025-01-08T20:29:00Z) - Detecting Turkish Synonyms Used in Different Time Periods [0.0]
Turkish is a prominent example of rapid linguistic transformation due to the language reform in the 20th century.
We propose two methods for detecting synonyms used in different time periods, focusing on Turkish.
arXiv Detail & Related papers (2024-11-24T09:31:38Z) - Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset [1.174020933567308]
This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts.
A dataset was created, KHAMIS, which consists of handwritten sentences in the East Syriac script.
The data was collected from volunteers capable of reading and writing in the language to create KHAMIS.
The handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets.
arXiv Detail & Related papers (2024-08-24T17:17:46Z) - Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language [0.0]
We investigate the evolution of the Turkish language since the establishment of T"urkiye in 1923.
Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases.
In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t"
arXiv Detail & Related papers (2024-05-16T14:31:07Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - New Results for the Text Recognition of Arabic Maghrib{\=i} Manuscripts
-- Managing an Under-resourced Script [0.0]
We introduce and assess a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib=i scripts.
The comparison between several state-of-the-art HTR models demonstrates the relevance of a word-based neural approach specialized for Arabic.
Results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing.
arXiv Detail & Related papers (2022-11-29T12:21:41Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.