PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
- URL: http://arxiv.org/abs/2404.16341v1
- Date: Thu, 25 Apr 2024 05:33:47 GMT
- Title: PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
- Authors: Stephen Bothwell, Brian DuSell, David Chiang, Brian Krostenko,
- Abstract summary: We introduce the Proto-Italic to Latin dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin.
We present baseline results for PILA on a pair of traditional computational historical linguistics tasks.
We demonstrate PILA's capability for enhancing other historical-linguistic datasets.
- Score: 11.820097994590672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA's value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA's capability for enhancing other historical-linguistic datasets through a dataset compatibility study.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - A Greek Parliament Proceedings Dataset for Computational Linguistics and
Political Analysis [4.396860522241306]
We introduce a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020.
It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files.
arXiv Detail & Related papers (2022-10-23T23:23:28Z) - Applying Feature Underspecified Lexicon Phonological Features in
Multilingual Text-to-Speech [1.9688095374610102]
We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to phonological features.
This mapping was tested for whether it could lead to the successful generation of native, non-native, and code-switched speech in the two languages.
arXiv Detail & Related papers (2022-04-14T21:04:55Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Applying Phonological Features in Multilingual Text-To-Speech [2.567123525861164]
We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to phonological features.
We tested whether this mapping could lead to the successful generation of native, non-native, and code-switched speech in the two languages.
arXiv Detail & Related papers (2021-10-07T16:37:01Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Deciphering Undersegmented Ancient Scripts Using Phonetic Prior [31.707254394215283]
Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges.
We propose a model that handles both of these challenges by building on rich linguistic constraints.
We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian)
arXiv Detail & Related papers (2020-10-21T15:03:52Z) - In search of isoglosses: continuous and discrete language embeddings in
Slavic historical phonology [0.0]
We employ three different types of language embedding (dense, sigmoid, and straight-through)
We find that the Straight-Through model outperforms the other two in terms of accuracy, but the Sigmoid model's language embeddings show the strongest agreement with the traditional subgrouping of the Slavic languages.
arXiv Detail & Related papers (2020-05-27T18:10:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.