No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet
Detection
- URL: http://arxiv.org/abs/2010.06906v1
- Date: Wed, 14 Oct 2020 09:37:51 GMT
- Title: No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet
Detection
- Authors: Debanjana Kar, Mohit Bhardwaj, Suranjana Samanta, Amar Prakash Azad
- Abstract summary: We propose an approach to detect fake news about COVID-19 early on from social media, such as tweets, for multiple Indic-Languages besides English.
To expand our approach to multiple Indic languages, we resort to mBERT based model which is fine-tuned over created dataset in Hindi and Bengali.
Our approach reaches around 89% F-Score in fake tweet detection which supercedes the state-of-the-art (SOTA) results.
- Score: 4.411285005377513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sudden widespread menace created by the present global pandemic COVID-19
has had an unprecedented effect on our lives. Man-kind is going through
humongous fear and dependence on social media like never before. Fear
inevitably leads to panic, speculations, and the spread of misinformation. Many
governments have taken measures to curb the spread of such misinformation for
public well being. Besides global measures, to have effective outreach, systems
for demographically local languages have an important role to play in this
effort. Towards this, we propose an approach to detect fake news about COVID-19
early on from social media, such as tweets, for multiple Indic-Languages
besides English. In addition, we also create an annotated dataset of Hindi and
Bengali tweet for fake news detection. We propose a BERT based model augmented
with additional relevant features extracted from Twitter to identify fake
tweets. To expand our approach to multiple Indic languages, we resort to mBERT
based model which is fine-tuned over created dataset in Hindi and Bengali. We
also propose a zero-shot learning approach to alleviate the data scarcity issue
for such low resource languages. Through rigorous experiments, we show that our
approach reaches around 89% F-Score in fake tweet detection which supercedes
the state-of-the-art (SOTA) results. Moreover, we establish the first benchmark
for two Indic-Languages, Hindi and Bengali. Using our annotated data, our model
achieves about 79% F-Score in Hindi and 81% F-Score for Bengali Tweets. Our
zero-shot model achieves about 81% F-Score in Hindi and 78% F-Score for Bengali
Tweets without any annotated data, which clearly indicates the efficacy of our
approach.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages [0.6526824510982802]
This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati.
The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content.
We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
arXiv Detail & Related papers (2023-10-03T17:53:09Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - Cross-lingual COVID-19 Fake News Detection [54.125563009333995]
We make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English)
We propose a deep learning framework named CrossFake to jointly encode the cross-lingual news body texts and capture the news content.
Empirical results on our dataset demonstrate the effectiveness of CrossFake under the cross-lingual setting.
arXiv Detail & Related papers (2021-10-13T04:44:02Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - Hostility Detection and Covid-19 Fake News Detection in Social Media [1.3499391168620467]
We build a model that makes use of an abusive language detector and features extracted via Hindi BERT and Hindi FastText models.
We also build models to identify fake news related to Covid-19 in English tweets.
arXiv Detail & Related papers (2021-01-15T03:24:36Z) - Evaluation of Deep Learning Models for Hostility Detection in Hindi Text [2.572404739180802]
We present approaches for hostile text detection in the Hindi language.
The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset.
We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem.
arXiv Detail & Related papers (2021-01-11T19:10:57Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.