Towards Responsible Natural Language Annotation for the Varieties of
Arabic
- URL: http://arxiv.org/abs/2203.09597v1
- Date: Thu, 17 Mar 2022 20:23:27 GMT
- Title: Towards Responsible Natural Language Annotation for the Varieties of
Arabic
- Authors: A. Stevie Bergman, Mona T. Diab
- Abstract summary: We present a playbook for responsible dataset creation for polyglossic, multidialectal languages.
This work is informed by a study on Arabic annotation of social media content.
- Score: 12.526184907781731
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: When building NLP models, there is a tendency to aim for broader coverage,
often overlooking cultural and (socio)linguistic nuance. In this position
paper, we make the case for care and attention to such nuances, particularly in
dataset annotation, as well as the inclusion of cultural and linguistic
expertise in the process. We present a playbook for responsible dataset
creation for polyglossic, multidialectal languages. This work is informed by a
study on Arabic annotation of social media content.
Related papers
- AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.
First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
We will release the dialectal translation models and benchmarks curated in this study.
arXiv Detail & Related papers (2024-09-17T17:59:25Z) - Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs)
We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Zero-shot Cross-Linguistic Learning of Event Semantics [27.997873309702225]
We look at captions of images across Arabic, Chinese, Farsi, German, Russian, and Turkish.
We show that lexical aspects can be predicted for a given language despite not having observed any annotated data for this language at all.
arXiv Detail & Related papers (2022-07-05T23:18:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.