KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for
Detection of Hate Speech and Offensive Code-Mixed Social Media text
- URL: http://arxiv.org/abs/2102.09866v1
- Date: Fri, 19 Feb 2021 11:08:02 GMT
- Title: KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for
Detection of Hate Speech and Offensive Code-Mixed Social Media text
- Authors: Varsha Pathak, Manish Joshi, Prasad Joshi, Monica Mundada and Tanmay
Joshi
- Abstract summary: This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European languages.
The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers.
The best performing classification models developed for both languages are applied on test datasets.
- Score: 1.0499611180329804
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper describes the system submitted by our team, KBCNMUJAL, for Task 2
of the shared task Hate Speech and Offensive Content Identification in
Indo-European Languages (HASOC), at Forum for Information Retrieval Evaluation,
December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages
Viz. Malayalam and Tamil of size 4000 observations, each were shared by the
HASOC organizers. These datasets are used to train the machine using different
machine learning algorithms, based on classification and regression models. The
datasets consist of tweets or YouTube comments with two class labels offensive
and not offensive. The machine is trained to classify such social media
messages in these two categories. Appropriate n-gram feature sets are extracted
to learn the specific characteristics of the Hate Speech text messages. These
feature models are based on TFIDF weights of n-gram. The referred work and
respective experiments show that the features such as word, character and
combined model of word and character n-grams could be used to identify the term
patterns of offensive text contents. As a part of the HASOC shared task, the
test data sets are made available by the HASOC track organizers. The best
performing classification models developed for both languages are applied on
test datasets. The model which gives the highest accuracy result on training
dataset for Malayalam language was experimented to predict the categories of
respective test data. This system has obtained an F1 score of 0.77. Similarly
the best performing model for Tamil language has obtained an F1 score of 0.87.
This work has received 2nd and 3rd rank in this shared Task 2 for Malayalam and
Tamil language respectively. The proposed system is named HASOC_kbcnmujal.
Related papers
- cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media
Comments using Spatio-Temporally Retrained Language Models [0.9012198585960441]
This paper describes our multiclass classification system developed as part of the LTERAN@LP-2023 shared task.
We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions.
We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score.
arXiv Detail & Related papers (2023-08-20T21:30:34Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for
Offensive Language Identification in Tanglish [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC2021: Hate Speech and Offensive Language Identification in Dravidian languages.
This task aims to identify offensive content in code-mixed comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2021-10-06T15:23:40Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language
Identification in Dravidian Languages [0.0]
The paper presents the submission of the team indicnlp@kgp to the EACL 2021 shared task "Offensive Language Identification in Dravidian languages"
The task aimed to classify different offensive content types in 3 code-mixed Dravidian language datasets.
We achieved weighted-average F1 scores of 0.97, 0.77, and 0.72 in the Malayalam-English, Tamil-English, and Kannada-English datasets.
arXiv Detail & Related papers (2021-02-14T13:24:01Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on
Synthetically Generated Code-Mixed Data for Hate Speech Detection [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC 2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English)
The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2020-10-05T15:25:47Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.