Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi
- URL: http://arxiv.org/abs/2109.03552v1
- Date: Wed, 8 Sep 2021 11:29:44 GMT
- Title: Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi
- Authors: Saurabh Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, Christopher M.
Homan
- Abstract summary: MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
- Score: 2.4737119633827174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread presence of offensive language on social media motivated the
development of systems capable of recognizing such content automatically. Apart
from a few notable exceptions, most research on automatic offensive language
identification has dealt with English. To address this shortcoming, we
introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first
dataset of its kind compiled for Marathi, thus opening a new domain for
research in low-resource Indo-Aryan languages. We present results from several
machine learning experiments on this dataset, including zero-short and other
transfer learning experiments on state-of-the-art cross-lingual transformers
from existing data in Bengali, English, and Hindi.
Related papers
- A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages [0.0]
This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo.
We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers.
We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%.
arXiv Detail & Related papers (2024-06-04T09:58:29Z) - Offensive Language Identification in Transliterated and Code-Mixed
Bangla [29.30985521838655]
In this paper, we explore offensive language identification in texts with transliterations and code-mixing.
We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments.
We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset.
arXiv Detail & Related papers (2023-11-25T13:27:22Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - SOLD: Sinhala Offensive Language Dataset [11.63228876521012]
This paper tackles offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka.
SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level.
We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
arXiv Detail & Related papers (2022-12-01T20:18:21Z) - Predicting the Type and Target of Offensive Social Media Posts in
Marathi [12.454406287184064]
We introduce the Marathi Offensive Language dataset v.2.0 or MOLD 2.0.
MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi.
We also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.
arXiv Detail & Related papers (2022-11-22T20:36:44Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.