Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network
- URL: http://arxiv.org/abs/2004.07807v2
- Date: Sun, 19 Apr 2020 17:21:30 GMT
- Title: Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network
- Authors: Md. Rezaul Karim and Bharathi Raja Chakravarthi and John P. McCrae and
Michael Cochez
- Abstract summary: We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText.
We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
- Score: 3.0168410626760034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exponential growths of social media and micro-blogging sites not only provide
platforms for empowering freedom of expressions and individual voices but also
enables people to express anti-social behaviour like online harassment,
cyberbullying, and hate speech. Numerous works have been proposed to utilize
these data for social and anti-social behaviours analysis, document
characterization, and sentiment analysis by predicting the contexts mostly for
highly resourced languages such as English. However, there are languages that
are under-resources, e.g., South Asian languages like Bengali, Tamil, Assamese,
Telugu that lack of computational resources for the NLP tasks. In this paper,
we provide several classification benchmarks for Bengali, an under-resourced
language. We prepared three datasets of expressing hate, commonly used topics,
and opinions for hate speech detection, document classification, and sentiment
analysis, respectively. We built the largest Bengali word embedding models to
date based on 250 million articles, which we call BengFastText. We perform
three different experiments, covering document classification, sentiment
analysis, and hate speech detection. We incorporate word embeddings into a
Multichannel Convolutional-LSTM (MConv-LSTM) network for predicting different
types of hate speech, document classification, and sentiment analysis.
Experiments demonstrate that BengFastText can capture the semantics of words
from respective contexts correctly. Evaluations against several baseline
embedding models, e.g., Word2Vec and GloVe yield up to 92.30%, 82.25%, and
90.45% F1-scores in case of document classification, sentiment analysis, and
hate speech detection, respectively during 5-fold cross-validation tests.
Related papers
- Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Hate Speech and Offensive Language Detection in Bengali [5.765076125746209]
We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets.
We implement several baseline models for the classification of such hateful posts.
We also explore the interlingual transfer mechanism to boost classification performance.
arXiv Detail & Related papers (2022-10-07T12:06:04Z) - Multimodal Hate Speech Detection from Bengali Memes and Texts [0.6709991492637819]
This paper is about hate speech detection from multimodal Bengali memes and texts.
We train several neural networks to analyze textual and visual information for hate speech detection.
Our study suggests that memes are moderately useful for hate speech detection in Bengali, but none of the multimodal models outperform unimodal models.
arXiv Detail & Related papers (2022-04-19T11:15:25Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced
Bengali Language [1.2246649738388389]
We propose an explainable approach for hate speech detection from the under-resourced Bengali language.
In our approach, Bengali texts are first comprehensively preprocessed, before classifying them into political, personal, geopolitical, and religious hates.
Evaluations against machine learning (linear and tree-based models) and deep neural networks (i.e., CNN, Bi-LSTM, and Conv-LSTM with word embeddings) baselines yield F1 scores of 84%, 90%, 88%, and 88%, for political, personal, geopolitical, and religious hates, respectively.
arXiv Detail & Related papers (2020-12-28T16:46:03Z) - Hate Speech detection in the Bengali language: A dataset and its
baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts.
All the comments are collected from YouTube and Facebook comment section and classified into seven categories.
A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.