For the Purpose of Curry: A UD Treebank for Ashokan Prakrit
- URL: http://arxiv.org/abs/2111.12783v1
- Date: Wed, 24 Nov 2021 20:30:09 GMT
- Title: For the Purpose of Curry: A UD Treebank for Ashokan Prakrit
- Authors: Adam Farris, Aryaman Arora
- Abstract summary: We present the first linguistically annotated treebank of Ashokan Prakrit.
This is an early Middle Indo-Aryan dialect continuum attested through Emperor Ashoka Maurya's 3rd century BCE rock and pillar edicts.
- Score: 2.538209532048867
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present the first linguistically annotated treebank of Ashokan Prakrit, an
early Middle Indo-Aryan dialect continuum attested through Emperor Ashoka
Maurya's 3rd century BCE rock and pillar edicts. For annotation, we used the
multilingual Universal Dependencies (UD) formalism, following recent UD work on
Sanskrit and other Indo-Aryan languages. We touch on some interesting
linguistic features that posed issues in annotation: regnal names and other
nominal compounds, "proto-ergative" participial constructions, and possible
grammaticalizations evidenced by sandhi (phonological assimilation across
morpheme boundaries). Eventually, we plan for a complete annotation of all
attested Ashokan texts, towards the larger goals of improving UD coverage of
different diachronic stages of Indo-Aryan and studying language change in
Indo-Aryan using computational methods.
Related papers
- MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - MASALA: Modelling and Analysing the Semantics of Adpositions in
Linguistic Annotation of Hindi [11.042037758273226]
We use language models to attempt automatic labelling of SNACS supersenses in Hindi.
We look towards upstream applications in semantic role labelling and extension to related languages such as Gujarati.
arXiv Detail & Related papers (2022-05-08T21:13:33Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages [34.79533646549939]
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning.
Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning.
arXiv Detail & Related papers (2021-09-22T06:37:39Z) - Automatic Speech Recognition in Sanskrit: A New Speech Corpus and
Modelling Insights [25.666767669695044]
We release a 78 hour ASR dataset for Sanskrit, which faithfully captures several of the linguistic characteristics expressed by the language.
We propose a new modelling unit, inspired by the syllable level unit selection, that captures character sequences from one vowel in the word to the next vowel.
We extend these insights from Sanskrit ASR for building ASR systems in two other Indic languages, Gujarati and Telugu.
arXiv Detail & Related papers (2021-06-02T18:06:32Z) - Towards One Model to Rule All: Multilingual Strategy for Dialectal
Code-Switching Arabic ASR [11.363966269198064]
We design a large multilingual end-to-end ASR using self-attention based conformer architecture.
We trained the system using Arabic (Ar), English (En) and French (Fr) languages.
Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
arXiv Detail & Related papers (2021-05-31T08:20:38Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.