Automated Transcription of Non-Latin Script Periodicals: A Case Study in
the Ottoman Turkish Print Archive
- URL: http://arxiv.org/abs/2011.01139v1
- Date: Mon, 2 Nov 2020 17:28:36 GMT
- Title: Automated Transcription of Non-Latin Script Periodicals: A Case Study in
the Ottoman Turkish Print Archive
- Authors: Suphan Kirmizialtin, David Wrisley
- Abstract summary: Our study utilizes deep learning methods for the automated transcription of periodicals written in Arabic script Ottoman Turkish (OT) using the Transkribus platform.
We discuss the historical situation of OT text collections and how they were excluded for the most part from the late twentieth century corpora digitization.
This exclusion has two basic reasons: the technical challenges of OCR for Arabic script languages, and the rapid abandonment of that very script in the Turkish historical context.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our study utilizes deep learning methods for the automated transcription of
late nineteenth- and early twentieth-century periodicals written in Arabic
script Ottoman Turkish (OT) using the Transkribus platform. We discuss the
historical situation of OT text collections and how they were excluded for the
most part from the late twentieth century corpora digitization that took place
in many Latin script languages. This exclusion has two basic reasons: the
technical challenges of OCR for Arabic script languages, and the rapid
abandonment of that very script in the Turkish historical context. In the
specific case of OT, opening periodical collections to digital tools require
training HTR models to generate transcriptions in the Latin writing system of
contemporary readers of Turkish, and not, as some may expect, in right-to-left
Arabic script text. In the paper we discuss the challenges of training such
models where one-to-one correspondence between the writing systems do not
exist, and we report results based on our HTR experiments with two OT
periodicals from the early twentieth century. Finally, we reflect on potential
domain bias of HTR models in historical languages exhibiting spatio-temporal
variance as well as the significance of working between writing systems for
language communities that have experienced language reform and script change.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language [0.0]
We investigate the evolution of the Turkish language since the establishment of T"urkiye in 1923.
Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases.
In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t"
arXiv Detail & Related papers (2024-05-16T14:31:07Z) - Multilingual Text-to-Speech Synthesis for Turkic Languages Using
Transliteration [3.0122461286351796]
This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages.
We specifically target the zero-shot learning scenario, where a TTS model trained using the data of one language is applied to synthesise speech for other, unseen languages.
An end-to-end TTS system based on the Tacotron 2 architecture was trained using only the available data of the Kazakh language.
arXiv Detail & Related papers (2023-05-25T05:57:54Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - New Results for the Text Recognition of Arabic Maghrib{\=i} Manuscripts
-- Managing an Under-resourced Script [0.0]
We introduce and assess a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib=i scripts.
The comparison between several state-of-the-art HTR models demonstrates the relevance of a word-based neural approach specialized for Arabic.
Results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing.
arXiv Detail & Related papers (2022-11-29T12:21:41Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.