Exploratory Arabic Offensive Language Dataset Analysis
        - URL: http://arxiv.org/abs/2101.11434v1
- Date: Wed, 20 Jan 2021 23:45:33 GMT
- Title: Exploratory Arabic Offensive Language Dataset Analysis
- Authors: Fatemah Husain and Ozlem Uzuner
- Abstract summary: This paper adds more insights towards resources and datasets used in Arabic offensive language research.
The main goal of this paper is to guide researchers in Arabic offensive language in selecting appropriate datasets based on their content.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   This paper adding more insights towards resources and datasets used in Arabic
offensive language research. The main goal of this paper is to guide
researchers in Arabic offensive language in selecting appropriate datasets
based on their content, and in creating new Arabic offensive language resources
to support and complement the available ones.
 
      
        Related papers
        - EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian [60.61343989805093]
 EmoBench-UA is the first annotated dataset for emotion detection in Ukrainian texts.<n>Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian.
 arXiv  Detail & Related papers  (2025-05-29T09:49:57Z)
- WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource   Languages [62.1053122134059]
 The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.
We have developed a systematic data processing framework tailored for low-resource languages.
 arXiv  Detail & Related papers  (2025-01-24T14:06:29Z)
- A Survey of Large Language Models for Arabic Language and its Dialects [0.0]
 This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects.
It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training.
The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks.
 arXiv  Detail & Related papers  (2024-10-26T17:48:20Z)
- Recent Advancements and Challenges of Turkic Central Asian Language   Processing [4.189204855014775]
 Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges.
Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
 arXiv  Detail & Related papers  (2024-07-06T08:58:26Z)
- Open the Data! Chuvash Datasets [50.59120569845975]
 We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
 arXiv  Detail & Related papers  (2024-05-31T07:51:19Z)
- 101 Billion Arabic Words Dataset [0.0]
 This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models.
We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files.
The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset.
 arXiv  Detail & Related papers  (2024-04-29T13:15:03Z)
- Can a Multichoice Dataset be Repurposed for Extractive Question   Answering? [52.28197971066953]
 We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
 arXiv  Detail & Related papers  (2024-04-26T11:46:05Z)
- ArabicaQA: A Comprehensive Dataset for Arabic Question Answering [13.65056111661002]
 We introduce ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic.
We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus.
 arXiv  Detail & Related papers  (2024-03-26T16:37:54Z)
- Aya Dataset: An Open-Access Collection for Multilingual Instruction
  Tuning [49.79783940841352]
 Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
 arXiv  Detail & Related papers  (2024-02-09T18:51:49Z)
- Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
 State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
 arXiv  Detail & Related papers  (2024-01-11T03:04:38Z)
- Toxic language detection: a systematic review of Arabic datasets [5.945303394300328]
 This paper offers a comprehensive survey of Arabic datasets focused on online toxic language.
We systematically gathered a total of 54 available datasets and their corresponding papers.
For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.
 arXiv  Detail & Related papers  (2023-12-12T12:43:01Z)
- AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
 The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
 arXiv  Detail & Related papers  (2023-09-21T13:20:13Z)
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
  Extremely Low-Resource Languages [54.808217147579036]
 We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
 arXiv  Detail & Related papers  (2023-09-19T14:42:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.