Evaluating BERT-based Pre-training Language Models for Detecting
Misinformation
- URL: http://arxiv.org/abs/2203.07731v1
- Date: Tue, 15 Mar 2022 08:54:36 GMT
- Title: Evaluating BERT-based Pre-training Language Models for Detecting
Misinformation
- Authors: Rini Anggrainingsih, Ghulam Mubashar Hassan and Amitava Datta
- Abstract summary: It is challenging to control the quality of online information due to the lack of supervision over all the information posted online.
There is a need for automated rumour detection techniques to limit the adverse effects of spreading misinformation.
This study proposes the BERT-based pre-trained language models to encode text data into vectors and utilise neural network models to classify these vectors to detect misinformation.
- Score: 2.1915057426589746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is challenging to control the quality of online information due to the
lack of supervision over all the information posted online. Manual checking is
almost impossible given the vast number of posts made on online media and how
quickly they spread. Therefore, there is a need for automated rumour detection
techniques to limit the adverse effects of spreading misinformation. Previous
studies mainly focused on finding and extracting the significant features of
text data. However, extracting features is time-consuming and not a highly
effective process. This study proposes the BERT- based pre-trained language
models to encode text data into vectors and utilise neural network models to
classify these vectors to detect misinformation. Furthermore, different
language models (LM) ' performance with different trainable parameters was
compared. The proposed technique is tested on different short and long text
datasets. The result of the proposed technique has been compared with the
state-of-the-art techniques on the same datasets. The results show that the
proposed technique performs better than the state-of-the-art techniques. We
also tested the proposed technique by combining the datasets. The results
demonstrated that the large data training and testing size considerably
improves the technique's performance.
Related papers
- Investigating the Impact of Semi-Supervised Methods with Data Augmentation on Offensive Language Detection in Romanian Language [2.2823100315094624]
Offensive language detection is a crucial task in today's digital landscape.
Building robust offensive language detection models requires large amounts of labeled data.
Semi-supervised learning offers a feasible solution by utilizing labeled and unlabeled data.
arXiv Detail & Related papers (2024-07-29T15:02:51Z) - Probing Language Models for Pre-training Data Detection [11.37731401086372]
We propose to utilize the probing technique for pre-training data detection by examining the model's internal activations.
Our method is simple and effective and leads to more trustworthy pre-training data detection.
arXiv Detail & Related papers (2024-06-03T13:58:04Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Towards Better Query Classification with Multi-Expert Knowledge
Condensation in JD Ads Search [12.701416688678622]
shallow model FastText is widely used for efficient online inference.
BERT is an effective solution, but it will cause a higher online inference latency and more expensive computing costs.
We propose knowledge condensation to boost the classification performance of the online FastText model under strict low latency constraints.
arXiv Detail & Related papers (2023-08-02T12:05:01Z) - DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of
Machine-Generated Text [26.02072055825044]
We introduce two novel zero-shot methods for detecting machine-generated text by leveraging the log rank information.
One is called DetectLLM-LRR, which is fast and efficient, and the other is called DetectLLM-NPR, which is more accurate, but slower due to the need for perturbations.
Our experiments on three datasets and seven language models show that our proposed methods improve over the state of the art by 3.9 and 1.75 AUROC points absolute.
arXiv Detail & Related papers (2023-05-23T11:18:30Z) - Transformer-based approaches to Sentiment Detection [55.41644538483948]
We examined the performance of four different types of state-of-the-art transformer models for text classification.
The RoBERTa transformer model performs best on the test dataset with a score of 82.6% and is highly recommended for quality predictions.
arXiv Detail & Related papers (2023-03-13T17:12:03Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Panning for gold: Lessons learned from the platform-agnostic automated
detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms.
We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks.
Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z) - Sample Efficient Approaches for Idiomaticity Detection [6.481818246474555]
This work explores sample efficient methods of idiomaticity detection.
In particular, we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings.
Our experiments show that whilePET improves performance on English, they are much less effective on Portuguese and Galician, leading to an overall performance about on par with vanilla mBERT.
arXiv Detail & Related papers (2022-05-23T13:46:35Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.