Data Bootstrapping Approaches to Improve Low Resource Abusive Language
Detection for Indic Languages
- URL: http://arxiv.org/abs/2204.12543v1
- Date: Tue, 26 Apr 2022 18:56:01 GMT
- Title: Data Bootstrapping Approaches to Improve Low Resource Abusive Language
Detection for Indic Languages
- Authors: Mithun Das and Somnath Banerjee and Animesh Mukherjee
- Abstract summary: We demonstrate a large-scale analysis of multilingual abusive speech in Indic languages.
We examine different interlingual transfer mechanisms and observe the performance of various multilingual models for abusive speech detection.
- Score: 5.51252705016179
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abusive language is a growing concern in many social media platforms.
Repeated exposure to abusive speech has created physiological effects on the
target users. Thus, the problem of abusive language should be addressed in all
forms for online peace and safety. While extensive research exists in abusive
speech detection, most studies focus on English. Recently, many smearing
incidents have occurred in India, which provoked diverse forms of abusive
speech in online space in various languages based on the geographic location.
Therefore it is essential to deal with such malicious content. In this paper,
to bridge the gap, we demonstrate a large-scale analysis of multilingual
abusive speech in Indic languages. We examine different interlingual transfer
mechanisms and observe the performance of various multilingual models for
abusive speech detection for eight different Indic languages. We also
experiment to show how robust these models are on adversarial attacks. Finally,
we conduct an in-depth error analysis by looking into the models' misclassified
posts across various settings. We have made our code and models public for
other researchers.
Related papers
- Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Examining Temporal Bias in Abusive Language Detection [3.465144840147315]
Machine learning models have been developed to automatically detect abusive language.
These models can suffer from temporal bias, the phenomenon in which topics, language use or social norms change over time.
This study investigates the nature and impact of temporal bias in abusive language detection across various languages.
arXiv Detail & Related papers (2023-09-25T13:59:39Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Abusive and Threatening Language Detection in Urdu using Boosting based
and BERT based models: A Comparative Approach [0.0]
In this paper, we explore several machine learning models for abusive and threatening content detection in Urdu based on the shared task.
Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.
arXiv Detail & Related papers (2021-11-27T20:03:19Z) - Cross-lingual Capsule Network for Hate Speech Detection in Social Media [6.531659195805749]
We investigate the cross-lingual hate speech detection task, tackling the problem by adapting the hate speech resources from one language to another.
We propose a cross-lingual capsule network learning model coupled with extra domain-specific lexical semantics for hate speech.
Our model achieves state-of-the-art performance on benchmark datasets from AMI@Evalita 2018 and AMI@Ibereval 2018.
arXiv Detail & Related papers (2021-08-06T12:53:41Z) - Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings [4.769747792846004]
We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used to classify other language.
We show that the use of our simple yet specific multilingual hate representations improves classification results.
arXiv Detail & Related papers (2021-04-30T02:24:50Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Detect All Abuse! Toward Universal Abusive Language Detection Models [5.840117063192334]
We introduce a new generic ALD framework, MACAS, which is capable of addressing several types of ALD tasks across different domains.
Our framework covers multi-aspect abusive language embeddings that represent the target and content aspects of abusive language.
Then, we propose and use the cross-attention gate flow mechanism to embrace multiple aspects of abusive language.
arXiv Detail & Related papers (2020-10-08T05:39:00Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.