A Twitter BERT Approach for Offensive Language Detection in Marathi
- URL: http://arxiv.org/abs/2212.10039v1
- Date: Tue, 20 Dec 2022 07:22:45 GMT
- Title: A Twitter BERT Approach for Offensive Language Detection in Marathi
- Authors: Tanmay Chavan, Shantanu Patankar, Aditya Kane, Omkar Gokhale, Raviraj
Joshi
- Abstract summary: This paper describes our work on Offensive Language Identification in low resource Indic language Marathi.
We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets.
The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated offensive language detection is essential in combating the spread
of hate speech, particularly in social media. This paper describes our work on
Offensive Language Identification in low resource Indic language Marathi. The
problem is formulated as a text classification task to identify a tweet as
offensive or non-offensive. We evaluate different mono-lingual and
multi-lingual BERT models on this classification task, focusing on BERT models
pre-trained with social media datasets. We compare the performance of MuRIL,
MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set.
We also explore external data augmentation from other existing Marathi hate
speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model,
pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC
2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43
on the HASOC 2022 test set. With this, we also provide a new state-of-the-art
result on HASOC 2022 / MOLD v2 test set.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages [0.6526824510982802]
This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati.
The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content.
We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
arXiv Detail & Related papers (2023-10-03T17:53:09Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - Hate and Offensive Speech Detection in Hindi and Marathi [0.0]
Still hate and offensive speech detection faces a challenge due to inadequate availability of data.
In this work, we consider hate and offensive speech detection in Hindi and Marathi texts.
We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa.
We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance.
arXiv Detail & Related papers (2021-10-23T11:57:36Z) - KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for
Detection of Hate Speech and Offensive Code-Mixed Social Media text [1.0499611180329804]
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European languages.
The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers.
The best performing classification models developed for both languages are applied on test datasets.
arXiv Detail & Related papers (2021-02-19T11:08:02Z) - HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for
Hate Speech Detection [9.23545668304066]
We propose an approach to automatically classify hate speech and offensive content.
We have used the datasets obtained from FIRE 2019 and 2020 shared tasks.
We observed that the pre-trained BERT model and the multilingual-BERT model gave the best results.
arXiv Detail & Related papers (2021-01-22T08:55:32Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.