News Category Dataset
- URL: http://arxiv.org/abs/2209.11429v1
- Date: Fri, 23 Sep 2022 06:13:16 GMT
- Title: News Category Dataset
- Authors: Rishabh Misra
- Abstract summary: We present a News Category dataset that contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
In this paper, we produce some novel insights from the dataset and describe various existing and potential applications of our dataset.
- Score: 1.7513645771137178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: People rely on news to know what is happening around the world and inform
their daily lives. In today's world, when the proliferation of fake news is
rampant, having a large-scale and high-quality source of authentic news
articles with the published category information is valuable to learning
authentic news' Natural Language syntax and semantics. As part of this work, we
present a News Category Dataset that contains around 200k news headlines from
the year 2012 to 2018 obtained from HuffPost, along with useful metadata to
enable various NLP tasks. In this paper, we also produce some novel insights
from the dataset and describe various existing and potential applications of
our dataset.
Related papers
- 3DLNews: A Three-decade Dataset of US Local News Articles [49.1574468325115]
3DLNews is a novel dataset with local news articles from the United States spanning the period from 1996 to 2024.
It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states.
arXiv Detail & Related papers (2024-08-08T18:33:37Z) - Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - A Multilingual Similarity Dataset for News Article Frame [14.977682986280998]
We introduce an extended version of a large labeled news article dataset with 16,687 new labeled pairs.
Our method frees the work of manual identification of frame classes in traditional news frame analysis studies.
Overall we introduce the most extensive cross-lingual news article similarity dataset available to date with 26,555 labeled news article pairs across 10 languages.
arXiv Detail & Related papers (2024-05-22T01:01:04Z) - Adapting Fake News Detection to the Era of Large Language Models [48.5847914481222]
We study the interplay between machine-(paraphrased) real news, machine-generated fake news, human-written fake news, and human-written real news.
Our experiments reveal an interesting pattern that detectors trained exclusively on human-written articles can indeed perform well at detecting machine-generated fake news, but not vice versa.
arXiv Detail & Related papers (2023-11-02T08:39:45Z) - fakenewsbr: A Fake News Detection Platform for Brazilian Portuguese [0.6775616141339018]
This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese.
We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF and Word2Vec.
We develop a user-friendly web platform, fakenewsbr.com, to facilitate the verification of news articles' veracity.
arXiv Detail & Related papers (2023-09-20T04:10:03Z) - Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing.
We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - Islander: A Real-Time News Monitoring and Analysis System [22.67888928983199]
We present Islander, an online news analyzing system.
The system allows users to browse trending topics with articles from multiple sources and perspectives.
We define several metrics as proxies for news quality, and develop algorithms for automatic estimation.
arXiv Detail & Related papers (2022-04-25T06:20:49Z) - Annotation-Scheme Reconstruction for "Fake News" and Japanese Fake News
Dataset [1.7149364927872013]
"Fake news" is a complex phenomenon that involves a wide range of issues.
We propose a novel annotation scheme with fine-grained labeling based on detailed investigations of existing fake news datasets.
Using the annotation scheme, we construct and publish the first Japanese fake news dataset.
arXiv Detail & Related papers (2022-04-06T10:42:39Z) - Faking Fake News for Real Fake News Detection: Propaganda-loaded
Training Data Generation [105.20743048379387]
We propose a novel framework for generating training examples informed by the known styles and strategies of human-authored propaganda.
Specifically, we perform self-critical sequence training guided by natural language inference to ensure the validity of the generated articles.
Our experimental results show that fake news detectors trained on PropaNews are better at detecting human-written disinformation by 3.62 - 7.69% F1 score on two public datasets.
arXiv Detail & Related papers (2022-03-10T14:24:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.