SPICED: News Similarity Detection Dataset with Multiple Topics and
Complexity Levels
- URL: http://arxiv.org/abs/2309.13080v1
- Date: Thu, 21 Sep 2023 10:55:26 GMT
- Title: SPICED: News Similarity Detection Dataset with Multiple Topics and
Complexity Levels
- Authors: Elena Shushkevich, Long Mai, Manuel V. Loureiro, Steven Derby, Tri
Kurniawan Wijaya
- Abstract summary: We propose a new dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports.
We present four distinct approaches for generating news pairs, which are used in the creation of datasets specifically designed for news similarity detection task.
- Score: 14.073585972409756
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Nowadays, the use of intelligent systems to detect redundant information in
news articles has become especially prevalent with the proliferation of news
media outlets in order to enhance user experience. However, the heterogeneous
nature of news can lead to spurious findings in these systems: Simple
heuristics such as whether a pair of news are both about politics can provide
strong but deceptive downstream performance. Segmenting news similarity
datasets into topics improves the training of these models by forcing them to
learn how to distinguish salient characteristics under more narrow domains.
However, this requires the existence of topic-specific datasets, which are
currently lacking. In this article, we propose a new dataset of similar news,
SPICED, which includes seven topics: Crime & Law, Culture & Entertainment,
Disasters & Accidents, Economy & Business, Politics & Conflicts, Science &
Technology, and Sports. Futhermore, we present four distinct approaches for
generating news pairs, which are used in the creation of datasets specifically
designed for news similarity detection task. We benchmarked the created
datasets using MinHash, BERT, SBERT, and SimCSE models.
Related papers
- A Multilingual Similarity Dataset for News Article Frame [14.977682986280998]
We introduce an extended version of a large labeled news article dataset with 16,687 new labeled pairs.
Our method frees the work of manual identification of frame classes in traditional news frame analysis studies.
Overall we introduce the most extensive cross-lingual news article similarity dataset available to date with 26,555 labeled news article pairs across 10 languages.
arXiv Detail & Related papers (2024-05-22T01:01:04Z) - From Nuisance to News Sense: Augmenting the News with Cross-Document
Evidence and Context [25.870137795858522]
We present NEWSSENSE, a novel sensemaking tool and reading interface designed to collect and integrate information from multiple news articles on a central topic.
NEWSSENSE augments a central, grounding article of the user's choice by linking it to related articles from different sources.
Our pilot study shows that NEWSSENSE has the potential to help users identify key information, verify the credibility of news articles, and explore different perspectives.
arXiv Detail & Related papers (2023-10-06T21:15:11Z) - Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News
Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection.
We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z) - Unsupervised Domain-agnostic Fake News Detection using Multi-modal Weak
Signals [19.22829945777267]
This work proposes an effective framework for unsupervised fake news detection, which first embeds the knowledge available in four modalities in news records.
Also, we propose a novel technique to construct news datasets minimizing the latent biases in existing news datasets.
We trained the proposed unsupervised framework using LUND-COVID to exploit the potential of large datasets.
arXiv Detail & Related papers (2023-05-18T23:49:31Z) - Towards Corpus-Scale Discovery of Selection Biases in News Coverage:
Comparing What Sources Say About Entities as a Start [65.28355014154549]
This paper investigates the challenges of building scalable NLP systems for discovering patterns of media selection biases directly from news content in massive-scale news corpora.
We show the capabilities of the framework through a case study on NELA-2020, a corpus of 1.8M news articles in English from 519 news sources worldwide.
arXiv Detail & Related papers (2023-04-06T23:36:45Z) - Nothing Stands Alone: Relational Fake News Detection with Hypergraph
Neural Networks [49.29141811578359]
We propose to leverage a hypergraph to represent group-wise interaction among news, while focusing on important news relations with its dual-level attention mechanism.
Our approach yields remarkable performance and maintains the high performance even with a small subset of labeled news data.
arXiv Detail & Related papers (2022-12-24T00:19:32Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - Fake News Quick Detection on Dynamic Heterogeneous Information Networks [3.599616699656401]
We propose a novel Dynamic Heterogeneous Graph Neural Network (DHGNN) for fake news quick detection.
We first implement BERT and fine-tuned BERT to get a semantic representation of the news article contents and author profiles.
Then, we construct the heterogeneous news-author graph to reflect contextual information and relationships.
arXiv Detail & Related papers (2022-05-14T11:23:25Z) - Faking Fake News for Real Fake News Detection: Propaganda-loaded
Training Data Generation [105.20743048379387]
We propose a novel framework for generating training examples informed by the known styles and strategies of human-authored propaganda.
Specifically, we perform self-critical sequence training guided by natural language inference to ensure the validity of the generated articles.
Our experimental results show that fake news detectors trained on PropaNews are better at detecting human-written disinformation by 3.62 - 7.69% F1 score on two public datasets.
arXiv Detail & Related papers (2022-03-10T14:24:19Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Adversarial Active Learning based Heterogeneous Graph Neural Network for
Fake News Detection [18.847254074201953]
We propose a novel fake news detection framework, namely Adversarial Active Learning-based Heterogeneous Graph Neural Network (AA-HGNN)
AA-HGNN utilizes an active learning framework to enhance learning performance, especially when facing the paucity of labeled data.
Experiments with two real-world fake news datasets show that our model can outperform text-based models and other graph-based models.
arXiv Detail & Related papers (2021-01-27T05:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.