Related papers: An exploration of features to improve the generalisability of fake news detection models

An exploration of features to improve the generalisability of fake news detection models

URL: http://arxiv.org/abs/2502.20299v1
Date: Thu, 27 Feb 2025 17:26:56 GMT
Title: An exploration of features to improve the generalisability of fake news detection models
Authors: Nathaniel Hoy, Theodora Koulouri,
Abstract summary: Existing NLP and supervised Machine Learning methods perform well under cross-validation but struggle to generalise across datasets.<n>This issue stems from coarsely labelled training data, where articles are labelled based on their publisher.<n>This study demonstrates that meaningful features can still be extracted from coarsely labelled data to improve real-world robustness.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Fake news poses global risks by influencing elections and spreading misinformation, making detection critical. Existing NLP and supervised Machine Learning methods perform well under cross-validation but struggle to generalise across datasets, even within the same domain. This issue stems from coarsely labelled training data, where articles are labelled based on their publisher, introducing biases that token-based models like TF-IDF and BERT are sensitive to. While Large Language Models (LLMs) offer promise, their application in fake news detection remains limited. This study demonstrates that meaningful features can still be extracted from coarsely labelled data to improve real-world robustness. Stylistic features-lexical, syntactic, and semantic-are explored due to their reduced sensitivity to dataset biases. Additionally, novel social-monetisation features are introduced, capturing economic incentives behind fake news, such as advertisements, external links, and social media elements. The study trains on the coarsely labelled NELA 2020-21 dataset and evaluates using the manually labelled Facebook URLs dataset, a gold standard for generalisability. Results highlight the limitations of token-based models trained on biased data and contribute to the scarce evidence on LLMs like LLaMa in this field. Findings indicate that stylistic and social-monetisation features offer more generalisable predictions than token-based methods and LLMs. Statistical and permutation feature importance analyses further reveal their potential to enhance performance and mitigate dataset biases, providing a path forward for improving fake news detection.

Related papers

Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [53.398270878295754]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z)
Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training [2.8402080392117757]
This work investigates Knowledge Graph-Augmented Training (KGAT) as a novel method to mitigate bias in large language models. Public datasets for bias assessment include Gender Shades, Bias in Bios, and FairFace. We also performed targeted mitigation strategies to correct biased associations, leading to a significant drop in biased output and improved bias metrics.
arXiv Detail & Related papers (2025-04-01T00:27:50Z)
Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data [3.7409402247241643]
Fake news poses a significant threat to public opinion and social stability in modern society.<n>This study presents a comparative evaluation of BERT-like encoder-only models and autoregressive decoder-only large language models (LLMs) for fake news detection.
arXiv Detail & Related papers (2024-12-18T19:15:17Z)
A Self-Learning Multimodal Approach for Fake News Detection [35.98977478616019]
We introduce a self-learning multimodal model for fake news classification.<n>The model leverages contrastive learning, a robust method for feature extraction that operates without requiring labeled data.<n>Our experimental results on a public dataset demonstrate that the proposed model outperforms several state-of-the-art classification approaches.
arXiv Detail & Related papers (2024-12-08T07:41:44Z)
Revisiting Fake News Detection: Towards Temporality-aware Evaluation by Leveraging Engagement Earliness [22.349521957987672]
Social graph-based fake news detection aims to identify news articles containing false information by utilizing social contexts. We formalize a more realistic evaluation scheme that mimics real-world scenarios. We show that the discriminative capabilities of conventional methods decrease sharply under this new setting.
arXiv Detail & Related papers (2024-11-19T05:08:00Z)
Dynamic Analysis and Adaptive Discriminator for Fake News Detection [59.41431561403343]
We propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search algorithm to leverage the self-reflective capabilities of large language models. For semantic-based methods, we define four typical deceit patterns to reveal the mechanisms behind fake news creation.
arXiv Detail & Related papers (2024-08-20T14:13:54Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations [63.52709761339949]
We first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods. We design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates.
arXiv Detail & Related papers (2024-07-19T14:53:18Z)
Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation [2.0411082897313984]
This study introduces a novel methodology that integrates human annotators and Large Language Models. The proposed framework integrates human annotation with the output of LLMs, depending on the model uncertainty levels. The empirical results show a substantial decrease in the costs associated with data annotation while either maintaining or improving model accuracy.
arXiv Detail & Related papers (2024-06-17T21:45:48Z)
BEADs: Bias Evaluation Across Domains [9.19312529999677]
Bias Evaluations Across Domains BEADs dataset is designed to support a wide array of NLP tasks.<n>A key focus of this paper is the gold label dataset that is annotated by GPT4 for scalabilty.<n>Our findings indicate that BEADs effectively identifies numerous biases when fine-tuned on this dataset.
arXiv Detail & Related papers (2024-06-06T16:18:30Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection. We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z)
Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.