Predicting the Type and Target of Offensive Social Media Posts in
Marathi
- URL: http://arxiv.org/abs/2211.12570v1
- Date: Tue, 22 Nov 2022 20:36:44 GMT
- Title: Predicting the Type and Target of Offensive Social Media Posts in
Marathi
- Authors: Marcos Zampieri, Tharindu Ranasinghe, Mrinal Chaudhari, Saurabh
Gaikwad, Prajwal Krishna, Mayuresh Nene, Shrunali Paygude
- Abstract summary: We introduce the Marathi Offensive Language dataset v.2.0 or MOLD 2.0.
MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi.
We also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.
- Score: 12.454406287184064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The presence of offensive language on social media is very common motivating
platforms to invest in strategies to make communities safer. This includes
developing robust machine learning systems capable of recognizing offensive
content online. Apart from a few notable exceptions, most research on automatic
offensive language identification has dealt with English and a few other high
resource languages such as French, German, and Spanish. In this paper we
address this gap by tackling offensive language identification in Marathi, a
low-resource Indo-Aryan language spoken in India. We introduce the Marathi
Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments
on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded
annotation to the levels B (type) and C (target) of the popular OLID taxonomy.
MOLD 2.0 is the first hierarchical offensive language dataset compiled for
Marathi, thus opening new avenues for research in low-resource Indo-Aryan
languages. Finally, we also introduce SeMOLD, a larger dataset annotated
following the semi-supervised methods presented in SOLID.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - MMT: A Multilingual and Multi-Topic Indian Social Media Dataset [1.0413233169366503]
Social media plays a significant role in cross-cultural communication.
A vast amount of this occurs in code-mixed and multilingual form.
We introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter.
arXiv Detail & Related papers (2023-04-02T21:39:00Z) - SOLD: Sinhala Offensive Language Dataset [11.63228876521012]
This paper tackles offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka.
SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level.
We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
arXiv Detail & Related papers (2022-12-01T20:18:21Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language
from ManglishTweets [0.0]
We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification- DravidianCodeMix.
It is a message level classification task.
An embedding model-based classifier identifies offensive and not offensive comments in our approach.
arXiv Detail & Related papers (2020-10-17T10:11:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.