Multilingual Abusiveness Identification on Code-Mixed Social Media Text
- URL: http://arxiv.org/abs/2204.01848v1
- Date: Tue, 1 Mar 2022 12:23:25 GMT
- Title: Multilingual Abusiveness Identification on Code-Mixed Social Media Text
- Authors: Ekagra Ranjan, Naman Poddar
- Abstract summary: We propose an approach for abusiveness identification on the multilingual Moj dataset which comprises of Indic languages.
Our approach tackles the common challenges of non-English social media content and can be extended to other languages as well.
- Score: 1.8275108630751844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social Media platforms have been seeing adoption and growth in their usage
over time. This growth has been further accelerated with the lockdown in the
past year when people's interaction, conversation, and expression were limited
physically. It is becoming increasingly important to keep the platform safe
from abusive content for better user experience. Much work has been done on
English social media content but text analysis on non-English social media is
relatively underexplored. Non-English social media content have the additional
challenges of code-mixing, transliteration and using different scripture in
same sentence. In this work, we propose an approach for abusiveness
identification on the multilingual Moj dataset which comprises of Indic
languages. Our approach tackles the common challenges of non-English social
media content and can be extended to other languages as well.
Related papers
- SS-GEN: A Social Story Generation Framework with Large Language Models [87.11067593512716]
Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines.
Social Stories are traditionally crafted by psychology experts under strict constraints to address these challenges.
We propose textbfSS-GEN, a framework to generate Social Stories in real-time with broad coverage.
arXiv Detail & Related papers (2024-06-22T00:14:48Z) - ArMeme: Propagandistic Content in Arabic Memes [9.48177009736915]
We develop an Arabic memes dataset with manual annotations of propagandistic content.
We provide a comprehensive analysis aiming to develop computational tools for their detection.
arXiv Detail & Related papers (2024-06-06T09:56:49Z) - IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
Local Languages [62.60787450345489]
We explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay.
Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing.
arXiv Detail & Related papers (2023-11-21T07:50:53Z) - ChatGPT for Us: Preserving Data Privacy in ChatGPT via Dialogue Text
Ambiguation to Expand Mental Health Care Delivery [52.73936514734762]
ChatGPT has gained popularity for its ability to generate human-like dialogue.
Data-sensitive domains face challenges in using ChatGPT due to privacy and data-ownership concerns.
We propose a text ambiguation framework that preserves user privacy.
arXiv Detail & Related papers (2023-05-19T02:09:52Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Toxicity Detection for Indic Multilingual Social Media Content [0.0]
This paper describes the system proposed by team 'Moj Masti' using the data provided by ShareChat/Moj in emphIIIT-D Abusive Comment Identification challenge.
We focus on how we can leverage multilingual transformer based pre-trained and fine-tuned models to approach code-mixed/code-switched classification tasks.
arXiv Detail & Related papers (2022-01-03T12:01:47Z) - Can You be More Social? Injecting Politeness and Positivity into
Task-Oriented Conversational Agents [60.27066549589362]
Social language used by human agents is associated with greater users' responsiveness and task completion.
The model uses a sequence-to-sequence deep learning architecture, extended with a social language understanding element.
Evaluation in terms of content preservation and social language level using both human judgment and automatic linguistic measures shows that the model can generate responses that enable agents to address users' issues in a more socially appropriate way.
arXiv Detail & Related papers (2020-12-29T08:22:48Z) - Characterising User Content on a Multi-lingual Social Network [9.13241181020543]
We present our characterisation of a multilingual social network in India called ShareChat.
We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019 across 14 languages.
We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images.
arXiv Detail & Related papers (2020-04-23T22:25:48Z) - A Unified System for Aggression Identification in English Code-Mixed and
Uni-Lingual Texts [25.15521897068512]
We introduce a unified and robust deep learning architecture which works for English code-mixed dataset and uni-lingual English dataset.
The devised system, uses psycho-linguistic features and very ba-sic linguistic features.
Our proposed system outperforms all the previous approaches on English code-mixed dataset and uni-lingual English dataset.
arXiv Detail & Related papers (2020-01-15T17:06:29Z) - "Hinglish" Language -- Modeling a Messy Code-Mixed Language [0.0]
This project focuses on using deep learning techniques to tackle a classification problem in categorizing social content written in Hindi-English into Abusive, Hate-Inducing and Not offensive categories.
We utilize bi-directional sequence models with easy text augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion.
arXiv Detail & Related papers (2019-12-30T23:01:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.