Building Odia Shallow Parser
- URL: http://arxiv.org/abs/2204.08960v1
- Date: Tue, 19 Apr 2022 15:58:30 GMT
- Title: Building Odia Shallow Parser
- Authors: Pruthwik Mishra and Dipti Misra Sharma
- Abstract summary: Many Indian languages are resource poor with respect to the availability of corpora in general.
This paper is an attempt towards creating quality annotated corpora for shallows.
The contribution of this paper is two folds: creation pos and chunk corpora for Odia and development of baseline systems for pos tagging and chunking in Odia.
- Score: 9.772106698388138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Shallow parsing is an essential task for many NLP applications like machine
translation, summarization, sentiment analysis, aspect identification and many
more. Quality annotated corpora is critical for building accurate shallow
parsers. Many Indian languages are resource poor with respect to the
availability of corpora in general. So, this paper is an attempt towards
creating quality corpora for shallow parsers. The contribution of this paper is
two folds: creation pos and chunk annotated corpora for Odia and development of
baseline systems for pos tagging and chunking in Odia.
Related papers
- Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora [1.0995326465245927]
We show that there are significant quality differences between different portions of web-mined corpora.
We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
arXiv Detail & Related papers (2024-02-12T07:03:14Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Universal Dependency Treebank for Odia Language [0.24466725954625887]
This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language.
The treebank contains approx. 1082 tokens (100 sentences) in Odia selected from "Samantar", the largest available parallel corpora collection for Indic languages.
The morphological analysis of the Odia treebank was performed using machine learning techniques.
arXiv Detail & Related papers (2022-05-24T11:19:26Z) - Survey of Aspect-based Sentiment Analysis Datasets [55.61047894397937]
Aspect-based sentiment analysis (ABSA) is a natural language processing problem that requires analyzing user-generated reviews.
Numerous yet scattered corpora for ABSA make it difficult for researchers to identify corpora best suited for a specific ABSA subtask quickly.
This study aims to present a database of corpora that can be used to train and assess autonomous ABSA systems.
arXiv Detail & Related papers (2022-04-11T16:23:36Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - A Survey of Unsupervised Dependency Parsing [62.16714720135358]
Unsupervised dependency parsing aims to learn a dependency from sentences that have no annotation of their correct parse trees.
Despite its difficulty, unsupervised parsing is an interesting research direction because of its capability of utilizing almost unlimited unannotated text data.
arXiv Detail & Related papers (2020-10-04T10:51:22Z) - AMALGUM -- A Free, Balanced, Multilayer English Web Corpus [14.073494095236027]
We present a genre-balanced English web corpus totaling 4M tokens.
By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets.
arXiv Detail & Related papers (2020-06-18T17:05:45Z) - Know thy corpus! Robust methods for digital curation of Web corpora [0.0]
This paper proposes a novel framework for digital curation of Web corpora.
It provides robust estimation of their parameters, such as their composition and the lexicon.
arXiv Detail & Related papers (2020-03-13T17:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.