Universal Dependency Treebank for Odia Language
- URL: http://arxiv.org/abs/2205.11976v1
- Date: Tue, 24 May 2022 11:19:26 GMT
- Title: Universal Dependency Treebank for Odia Language
- Authors: Shantipriya Parida, Kalyanamalini Sahoo, Atul Kr. Ojha, Saraswati
Sahoo, Satya Ranjan Dash, Bijayalaxmi Dash
- Abstract summary: This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language.
The treebank contains approx. 1082 tokens (100 sentences) in Odia selected from "Samantar", the largest available parallel corpora collection for Indic languages.
The morphological analysis of the Odia treebank was performed using machine learning techniques.
- Score: 0.24466725954625887
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents the first publicly available treebank of Odia, a
morphologically rich low resource Indian language. The treebank contains
approx. 1082 tokens (100 sentences) in Odia selected from "Samantar", the
largest available parallel corpora collection for Indic languages. All the
selected sentences are manually annotated following the ``Universal Dependency
(UD)" guidelines. The morphological analysis of the Odia treebank was performed
using machine learning techniques. The Odia annotated treebank will enrich the
Odia language resource and will help in building language technology tools for
cross-lingual learning and typological research. We also build a preliminary
Odia parser using a machine learning approach. The accuracy of the parser is
86.6% Tokenization, 64.1% UPOS, 63.78% XPOS, 42.04% UAS and 21.34% LAS.
Finally, the paper briefly discusses the linguistic analysis of the Odia UD
treebank.
Related papers
- LuxBank: The First Universal Dependency Treebank for Luxembourgish [0.38447712214412116]
Luxembourgish is a West Germanic language spoken by approximately 400,000 people.
We introduce LuxBank, the first Universal Dependencies (UD) Treebank for Luxembourgish.
arXiv Detail & Related papers (2024-11-07T15:50:40Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - GlobalBench: A Benchmark for Global Progress in Natural Language
Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages.
Tracks estimated per-speaker utility and equity of technology across all languages.
Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Building an Endangered Language Resource in the Classroom: Universal
Dependencies for Kakataibo [0.8938910048099864]
We launch a new Universal Dependencies treebank for an endangered language from Amazonia: Kakataibo, a Panoan language spoken in Peru.
We first discuss the collaborative methodology implemented, which proved effective to create a treebank in the context of a Computational Linguistic course for undergraduates.
arXiv Detail & Related papers (2022-06-21T12:58:56Z) - Developing Universal Dependency Treebanks for Magahi and Braj [0.7349727826230861]
In this paper, we discuss the development of treebanks for two low-resourced Indian languages - Magahi and Braj.
The Magahi treebank contains 945 sentences and Braj treebank around 500 sentences marked with their lemmas, part-of-speech, morphological features and universal dependencies.
arXiv Detail & Related papers (2022-04-26T23:43:41Z) - Apurin\~a Universal Dependencies Treebank [0.4893345190925178]
This paper presents and discusses the first Universal Dependencies treebank for the Apurina language.
The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features.
arXiv Detail & Related papers (2021-06-07T07:42:00Z) - The Persian Dependency Treebank Made Universal [3.4410212782758047]
This treebank contains 29107 sentences.
Our data is more compatible with Universal Dependencies than the Persian Universal Dependency Treebank (Seraji et al., 2016)
Our delexicalized Persian-to-English transfer experiments show that a parsing model trained on our data is 2% more accurate than that of Seraji et al.
arXiv Detail & Related papers (2020-09-21T22:34:13Z) - Stanza: A Python Natural Language Processing Toolkit for Many Human
Languages [44.8226642800919]
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.
Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging.
We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora.
arXiv Detail & Related papers (2020-03-16T09:05:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.