Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language
- URL: http://arxiv.org/abs/2306.14866v1
- Date: Mon, 26 Jun 2023 17:27:31 GMT
- Title: Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language
- Authors: Riabi Arij, Mahamdi Menel, Seddah Djam\'e
- Abstract summary: NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we address the scarcity of annotated data for NArabizi, a
Romanized form of North African Arabic used mostly on social media, which poses
challenges for Natural Language Processing (NLP). We introduce an enriched
version of NArabizi Treebank (Seddah et al., 2020) with three main
contributions: the addition of two novel annotation layers (named entity
recognition and offensive language detection) and a re-annotation of the
tokenization, morpho-syntactic and syntactic layers that ensure annotation
consistency. Our experimental results, using different tokenization schemes,
showcase the value of our contributions and highlight the impact of working
with non-gold tokenization for NER and dependency parsing. To facilitate future
research, we make these annotations publicly available. Our enhanced NArabizi
Treebank paves the way for creating sophisticated language models and NLP tools
for this under-represented language.
Related papers
- Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - Specifying Genericity through Inclusiveness and Abstractness Continuous Scales [1.024113475677323]
This paper introduces a novel annotation framework for the fine-grained modeling of Noun Phrases' (NPs) genericity in natural language.
The framework is designed to be simple and intuitive, making it accessible to non-expert annotators and suitable for crowd-sourced tasks.
arXiv Detail & Related papers (2024-03-22T15:21:07Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - MASALA: Modelling and Analysing the Semantics of Adpositions in
Linguistic Annotation of Hindi [11.042037758273226]
We use language models to attempt automatic labelling of SNACS supersenses in Hindi.
We look towards upstream applications in semantic role labelling and extension to related languages such as Gujarati.
arXiv Detail & Related papers (2022-05-08T21:13:33Z) - Towards Responsible Natural Language Annotation for the Varieties of
Arabic [12.526184907781731]
We present a playbook for responsible dataset creation for polyglossic, multidialectal languages.
This work is informed by a study on Arabic annotation of social media content.
arXiv Detail & Related papers (2022-03-17T20:23:27Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.