Two-Stage Classifier for COVID-19 Misinformation Detection Using BERT: a
Study on Indonesian Tweets
- URL: http://arxiv.org/abs/2206.15359v1
- Date: Thu, 30 Jun 2022 15:33:20 GMT
- Title: Two-Stage Classifier for COVID-19 Misinformation Detection Using BERT: a
Study on Indonesian Tweets
- Authors: Douglas Raevan Faisal and Rahmad Mahendra
- Abstract summary: Research on COVID-19 misinformation detection in Indonesia is still scarce.
In this study, we propose the two-stage classifier model using IndoBERT pre-trained language model for the Tweet misinformation detection task.
The experimental results show that the combination of the BERT sequence classifier for relevance prediction and Bi-LSTM for misinformation detection outperformed other machine learning models with an accuracy of 87.02%.
- Score: 0.15229257192293202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The COVID-19 pandemic has caused globally significant impacts since the
beginning of 2020. This brought a lot of confusion to society, especially due
to the spread of misinformation through social media. Although there were
already several studies related to the detection of misinformation in social
media data, most studies focused on the English dataset. Research on COVID-19
misinformation detection in Indonesia is still scarce. Therefore, through this
research, we collect and annotate datasets for Indonesian and build prediction
models for detecting COVID-19 misinformation by considering the tweet's
relevance. The dataset construction is carried out by a team of annotators who
labeled the relevance and misinformation of the tweet data. In this study, we
propose the two-stage classifier model using IndoBERT pre-trained language
model for the Tweet misinformation detection task. We also experiment with
several other baseline models for text classification. The experimental results
show that the combination of the BERT sequence classifier for relevance
prediction and Bi-LSTM for misinformation detection outperformed other machine
learning models with an accuracy of 87.02%. Overall, the BERT utilization
contributes to the higher performance of most prediction models. We release a
high-quality COVID-19 misinformation Tweet corpus in the Indonesian language,
indicated by the high inter-annotator agreement.
Related papers
- Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - A Large-Scale Comparative Study of Accurate COVID-19 Information versus
Misinformation [4.926199465135915]
The COVID-19 pandemic led to an infodemic where an overwhelming amount of COVID-19 related content was being disseminated at high velocity through social media.
This motivated us to carry out a comparative study of the characteristics of COVID-19 misinformation versus those of accurate COVID-19 information through a large-scale computational analysis of over 242 million tweets.
An added contribution of this study is the creation of a COVID-19 misinformation classification dataset.
arXiv Detail & Related papers (2023-04-10T18:44:41Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Testing the Generalization of Neural Language Models for COVID-19
Misinformation Detection [6.1204874238049705]
A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic.
We evaluate fifteen Transformer-based models on five COVID-19 misinformation datasets.
We show tokenizers and models tailored to COVID-19 data do not provide a significant advantage over general-purpose ones.
arXiv Detail & Related papers (2021-11-15T15:01:55Z) - Combat COVID-19 Infodemic Using Explainable Natural Language Processing
Models [15.782463163357976]
We propose an explainable natural language processing model based on DistilBERT and SHAP to combat misinformation about COVID-19.
Our results provided good implications in detecting misinformation about COVID-19 and improving public trust.
arXiv Detail & Related papers (2021-03-01T04:28:39Z) - Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on
the Arabic Content of Twitter [0.23624125155742054]
We construct a large Arabic dataset related to COVID-19 misinformation and gold-annotate the tweets into two categories: misinformation or not.
We apply eight different traditional and deep machine learning models, with different features including word embeddings and word frequency.
Experiments show that optimizing the area under the curve (AUC) improves the models' performance and the Extreme Gradient Boosting (XGBoost) presents the highest accuracy in detecting COVID-19 misinformation online.
arXiv Detail & Related papers (2021-01-09T22:52:21Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Misinformation Has High Perplexity [55.47422012881148]
We propose to leverage the perplexity to debunk false claims in an unsupervised manner.
First, we extract reliable evidence from scientific and news sources according to sentence similarity to the claims.
Second, we prime a language model with the extracted evidence and finally evaluate the correctness of given claims based on the perplexity scores at debunking time.
arXiv Detail & Related papers (2020-06-08T15:13:44Z) - Independent Component Analysis for Trustworthy Cyberspace during High
Impact Events: An Application to Covid-19 [4.629100947762816]
Social media has become an important communication channel during high impact events, such as the COVID-19 pandemic.
As misinformation in social media can rapidly spread, creating social unrest, curtailing the spread of misinformation during such events is a significant data challenge.
We propose a data-driven solution that is based on the ICA model, such that knowledge discovery and detection of misinformation are achieved jointly.
arXiv Detail & Related papers (2020-06-01T21:48:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.