Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language
- URL: http://arxiv.org/abs/2405.10133v1
- Date: Thu, 16 May 2024 14:31:07 GMT
- Title: Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language
- Authors: Togay Yazar, Mucahid Kutlu, İsa Kerem Bayırlı,
- Abstract summary: We investigate the evolution of the Turkish language since the establishment of T"urkiye in 1923.
Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases.
In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t"
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past century, the Turkish language has undergone substantial changes, primarily driven by governmental interventions. In this work, our goal is to investigate the evolution of the Turkish language since the establishment of T\"urkiye in 1923. Thus, we first introduce Turkronicles which is a diachronic corpus for Turkish derived from the Official Gazette of T\"urkiye. Turkronicles contains 45,375 documents, detailing governmental actions, making it a pivotal resource for analyzing the linguistic evolution influenced by the state policies. In addition, we expand an existing diachronic Turkish corpus which consists of the records of the Grand National Assembly of T\"urkiye by covering additional years. Next, combining these two diachronic corpora, we seek answers for two main research questions: How have the Turkish vocabulary and the writing conventions changed since the 1920s? Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases, and newly coined Turkish words take the place of their old counterparts. We also observe changes in writing conventions. In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t" letters, respectively. Overall, this study quantitatively highlights the dramatic changes in Turkish from various aspects of the language in a diachronic perspective.
Related papers
- Turkish Delights: a Dataset on Turkish Euphemisms [1.7614751781649955]
This research extends the current computational work on potentially euphemistic terms (PETs) to Turkish.
We introduce the Turkish PET dataset, the first available of its kind in the field.
We provide both euphemistic and non-euphemistic examples of PETs in Turkish.
arXiv Detail & Related papers (2024-07-17T22:13:42Z) - TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish [54.51310112013655]
We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU.
TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula.
We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
arXiv Detail & Related papers (2024-07-17T08:28:55Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - Turkish Native Language Identification [0.0]
We present the first application of Native Language Identification (NLI) for the Turkish language.
We employ a combination of three syntactic features (CFG production rules, part-of-speech n-grams, and function words) with L2 texts to demonstrate their effectiveness.
arXiv Detail & Related papers (2023-07-27T13:28:31Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - TuGeBiC: A Turkish German Bilingual Code-Switching Corpus [0.0]
We describe the process of collection, transcription, and annotation of recordings of spontaneous speech samples from Turkish-German bilinguals.
The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms.
The resulting corpus has been made freely available to the research community.
arXiv Detail & Related papers (2022-05-02T12:53:05Z) - Mukayese: Turkish NLP Strikes Back [0.19116784879310023]
We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications.
We present Mukayese, a set of NLP benchmarks for the Turkish language.
We present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
arXiv Detail & Related papers (2022-03-02T16:18:44Z) - When is Wall a Pared and when a Muro? -- Extracting Rules Governing
Lexical Selection [85.0262994506624]
We present a method for automatically identifying fine-grained lexical distinctions.
We extract concise descriptions explaining these distinctions in a human- and machine-readable format.
We use these descriptions to teach non-native speakers when to translate a given ambiguous word into its different possible translations.
arXiv Detail & Related papers (2021-09-13T14:49:00Z) - Lexical semantic change for Ancient Greek and Latin [61.69697586178796]
Associating a word's correct meaning in its historical context is a central challenge in diachronic research.
We build on a recent computational approach to semantic change based on a dynamic Bayesian mixture model.
We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models.
arXiv Detail & Related papers (2021-01-22T12:04:08Z) - Automated Transcription of Non-Latin Script Periodicals: A Case Study in
the Ottoman Turkish Print Archive [0.0]
Our study utilizes deep learning methods for the automated transcription of periodicals written in Arabic script Ottoman Turkish (OT) using the Transkribus platform.
We discuss the historical situation of OT text collections and how they were excluded for the most part from the late twentieth century corpora digitization.
This exclusion has two basic reasons: the technical challenges of OCR for Arabic script languages, and the rapid abandonment of that very script in the Turkish historical context.
arXiv Detail & Related papers (2020-11-02T17:28:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.