Related papers: Building Odia Shallow Parser

Building Odia Shallow Parser

URL: http://arxiv.org/abs/2204.08960v1
Date: Tue, 19 Apr 2022 15:58:30 GMT
Title: Building Odia Shallow Parser
Authors: Pruthwik Mishra and Dipti Misra Sharma
Abstract summary: Many Indian languages are resource poor with respect to the availability of corpora in general. This paper is an attempt towards creating quality annotated corpora for shallows. The contribution of this paper is two folds: creation pos and chunk corpora for Odia and development of baseline systems for pos tagging and chunking in Odia.
Score: 9.772106698388138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Shallow parsing is an essential task for many NLP applications like machine translation, summarization, sentiment analysis, aspect identification and many more. Quality annotated corpora is critical for building accurate shallow parsers. Many Indian languages are resource poor with respect to the availability of corpora in general. So, this paper is an attempt towards creating quality corpora for shallow parsers. The contribution of this paper is two folds: creation pos and chunk annotated corpora for Odia and development of baseline systems for pos tagging and chunking in Odia.

Related papers

Urdu Dependency Parsing and Treebank Development: A Syntactic and Morphological Perspective [0.0]
We use dependency parsing to analyze news articles in Urdu. We achieve a best-labeled accuracy (LA) of 70% and an unlabeled attachment score (UAS) of 84%.
arXiv Detail & Related papers (2024-06-13T19:30:32Z)
What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
Universal Dependency Treebank for Odia Language [0.24466725954625887]
This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language. The treebank contains approx. 1082 tokens (100 sentences) in Odia selected from "Samantar", the largest available parallel corpora collection for Indic languages. The morphological analysis of the Odia treebank was performed using machine learning techniques.
arXiv Detail & Related papers (2022-05-24T11:19:26Z)
Survey of Aspect-based Sentiment Analysis Datasets [55.61047894397937]
Aspect-based sentiment analysis (ABSA) is a natural language processing problem that requires analyzing user-generated reviews. Numerous yet scattered corpora for ABSA make it difficult for researchers to identify corpora best suited for a specific ABSA subtask quickly. This study aims to present a database of corpora that can be used to train and assess autonomous ABSA systems.
arXiv Detail & Related papers (2022-04-11T16:23:36Z)
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z)
A Survey of Unsupervised Dependency Parsing [62.16714720135358]
Unsupervised dependency parsing aims to learn a dependency from sentences that have no annotation of their correct parse trees. Despite its difficulty, unsupervised parsing is an interesting research direction because of its capability of utilizing almost unlimited unannotated text data.
arXiv Detail & Related papers (2020-10-04T10:51:22Z)
AMALGUM -- A Free, Balanced, Multilayer English Web Corpus [14.073494095236027]
We present a genre-balanced English web corpus totaling 4M tokens. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets.
arXiv Detail & Related papers (2020-06-18T17:05:45Z)
Know thy corpus! Robust methods for digital curation of Web corpora [0.0]
This paper proposes a novel framework for digital curation of Web corpora. It provides robust estimation of their parameters, such as their composition and the lexicon.
arXiv Detail & Related papers (2020-03-13T17:21:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.