MN-DS: A Multilabeled News Dataset for News Articles Hierarchical
Classification
- URL: http://arxiv.org/abs/2212.12061v3
- Date: Sun, 23 Apr 2023 14:49:44 GMT
- Title: MN-DS: A Multilabeled News Dataset for News Articles Hierarchical
Classification
- Authors: Alina Petukhova, Nuno Fachada
- Abstract summary: This article presents a dataset of 10,917 news articles with hierarchical news categories collected between 1 January 2019 and 31 December 2019.
We manually labeled the articles based on a hierarchical taxonomy with 17 first-level and 109 second-level categories.
This dataset can be used to train machine learning models for automatically classifying news articles by topic.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article presents a dataset of 10,917 news articles with hierarchical
news categories collected between 1 January 2019 and 31 December 2019. We
manually labeled the articles based on a hierarchical taxonomy with 17
first-level and 109 second-level categories. This dataset can be used to train
machine learning models for automatically classifying news articles by topic.
This dataset can be helpful for researchers working on news structuring,
classification, and predicting future events based on released news.
Related papers
- The 2021 Tokyo Olympics Multilingual News Article Dataset [0.9749638953163389]
A total of 10,940 news articles were gathered from 1,918 different publishers covering 1,350 sub-events of the 2021 Olympics.
These articles are written in nine languages from different language families and in different scripts.
The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms.
arXiv Detail & Related papers (2025-02-10T16:38:03Z) - NewsEdits 2.0: Learning the Intentions Behind Updating News [74.84017890548259]
As events progress, news articles often update with new information: if we are not cautious, we risk propagating outdated facts.
In this work, we hypothesize that linguistic features indicate factual fluidity, and that we can predict which facts in a news article will update using solely the text of a news article.
arXiv Detail & Related papers (2024-11-27T23:35:23Z) - TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu [4.272315504476224]
relevance-based headline classification can greatly aid the task of generating relevant headlines.
We present TeClass, the first-ever human-annotated Telugu news headline classification dataset.
The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores.
arXiv Detail & Related papers (2024-04-17T13:07:56Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - Hierarchical Multi-Label Classification of Scientific Documents [47.293189105900524]
We introduce a new dataset for hierarchical multi-label text classification of scientific papers called SciHTC.
This dataset contains 186,160 papers and 1,233 categories from the ACM CCS tree.
Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities.
arXiv Detail & Related papers (2022-11-05T04:12:57Z) - News Category Dataset [1.7513645771137178]
We present a News Category dataset that contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
In this paper, we produce some novel insights from the dataset and describe various existing and potential applications of our dataset.
arXiv Detail & Related papers (2022-09-23T06:13:16Z) - UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu [62.6928395368204]
This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language.
The goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing.
The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business.
arXiv Detail & Related papers (2022-07-25T03:46:51Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - N15News: A New Dataset for Multimodal News Classification [7.846107230241092]
We propose a new dataset, N15News, which is generated from New York Times with 15 categories and contains both text and image information in each news.
We design a novel multitask multimodal network with different fusion methods, and experiments show multimodal news classification performs better than text-only news classification.
arXiv Detail & Related papers (2021-08-30T15:46:09Z) - 365 Dots in 2019: Quantifying Attention of News Sources [69.50862982117125]
We measure the overlap of topics of online news articles from a variety of sources.
We score news stories according to the degree of attention in near-real time.
This can enable multiple studies, including identifying topics that receive the most attention.
arXiv Detail & Related papers (2020-03-22T20:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.