Related papers: Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

URL: http://arxiv.org/abs/2204.12543v1
Date: Tue, 26 Apr 2022 18:56:01 GMT
Title: Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages
Authors: Mithun Das and Somnath Banerjee and Animesh Mukherjee
Abstract summary: We demonstrate a large-scale analysis of multilingual abusive speech in Indic languages. We examine different interlingual transfer mechanisms and observe the performance of various multilingual models for abusive speech detection.
Score: 5.51252705016179
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Abusive language is a growing concern in many social media platforms. Repeated exposure to abusive speech has created physiological effects on the target users. Thus, the problem of abusive language should be addressed in all forms for online peace and safety. While extensive research exists in abusive speech detection, most studies focus on English. Recently, many smearing incidents have occurred in India, which provoked diverse forms of abusive speech in online space in various languages based on the geographic location. Therefore it is essential to deal with such malicious content. In this paper, to bridge the gap, we demonstrate a large-scale analysis of multilingual abusive speech in Indic languages. We examine different interlingual transfer mechanisms and observe the performance of various multilingual models for abusive speech detection for eight different Indic languages. We also experiment to show how robust these models are on adversarial attacks. Finally, we conduct an in-depth error analysis by looking into the models' misclassified posts across various settings. We have made our code and models public for other researchers.

Related papers

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages [47.45957604683302]
We study whether pre-trained language models are agnostic to linguistically grounded attacks or not. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks.
arXiv Detail & Related papers (2024-12-14T12:10:38Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Examining Temporal Bias in Abusive Language Detection [3.465144840147315]
Machine learning models have been developed to automatically detect abusive language. These models can suffer from temporal bias, the phenomenon in which topics, language use or social norms change over time. This study investigates the nature and impact of temporal bias in abusive language detection across various languages.
arXiv Detail & Related papers (2023-09-25T13:59:39Z)
Countering Malicious Content Moderation Evasion in Online Social Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems. This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z)
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language. We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)
Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach [0.0]
In this paper, we explore several machine learning models for abusive and threatening content detection in Urdu based on the shared task. Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.
arXiv Detail & Related papers (2021-11-27T20:03:19Z)
Cross-lingual Capsule Network for Hate Speech Detection in Social Media [6.531659195805749]
We investigate the cross-lingual hate speech detection task, tackling the problem by adapting the hate speech resources from one language to another. We propose a cross-lingual capsule network learning model coupled with extra domain-specific lexical semantics for hate speech. Our model achieves state-of-the-art performance on benchmark datasets from AMI@Evalita 2018 and AMI@Ibereval 2018.
arXiv Detail & Related papers (2021-08-06T12:53:41Z)
Cross-lingual hate speech detection based on multilingual domain-specific word embeddings [4.769747792846004]
We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning. Our goal is to determine if knowledge from one particular language can be used to classify other language. We show that the use of our simple yet specific multilingual hate representations improves classification results.
arXiv Detail & Related papers (2021-04-30T02:24:50Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Detect All Abuse! Toward Universal Abusive Language Detection Models [5.840117063192334]
We introduce a new generic ALD framework, MACAS, which is capable of addressing several types of ALD tasks across different domains. Our framework covers multi-aspect abusive language embeddings that represent the target and content aspects of abusive language. Then, we propose and use the cross-attention gate flow mechanism to embrace multiple aspects of abusive language.
arXiv Detail & Related papers (2020-10-08T05:39:00Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.