Related papers: \textit{NewsEdits}: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

\textit{NewsEdits}: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

URL: http://arxiv.org/abs/2104.09647v1
Date: Mon, 19 Apr 2021 21:15:30 GMT
Title: \textit{NewsEdits}: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)
Authors: Alexander Spangher and Jonathan May
Abstract summary: textitNewsEdits is the first publicly available dataset of news article revision histories. It contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources.
Score: 89.77347919191774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or \textit{NewsEdits}. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. \textit{NewsEdits} is, to our knowledge, the largest corpus of revision histories of any domain.

Related papers

The 2021 Tokyo Olympics Multilingual News Article Dataset [0.9749638953163389]
A total of 10,940 news articles were gathered from 1,918 different publishers covering 1,350 sub-events of the 2021 Olympics. These articles are written in nine languages from different language families and in different scripts. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms.
arXiv Detail & Related papers (2025-02-10T16:38:03Z)
NewsEdits 2.0: Learning the Intentions Behind Updating News [74.84017890548259]
As events progress, news articles often update with new information: if we are not cautious, we risk propagating outdated facts. In this work, we hypothesize that linguistic features indicate factual fluidity, and that we can predict which facts in a news article will update using solely the text of a news article.
arXiv Detail & Related papers (2024-11-27T23:35:23Z)
3DLNews: A Three-decade Dataset of US Local News Articles [49.1574468325115]
3DLNews is a novel dataset with local news articles from the United States spanning the period from 1996 to 2024. It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states.
arXiv Detail & Related papers (2024-08-08T18:33:37Z)
Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z)
MegaWika: Millions of reports and their sources across 50 diverse languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z)
Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection. The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z)
NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories. It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z)
Multilingual Open Text 1.0: Public Domain News in 44 Languages [2.642698101441705]
The first release of the corpus contains over 2.7 million news articles and 1 million shorter passages published between 2001--2021. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0) and all software used to create the corpus is released under the MIT License.
arXiv Detail & Related papers (2022-01-14T18:58:17Z)
A System for Worldwide COVID-19 Information Aggregation [92.60866520230803]
We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently.
arXiv Detail & Related papers (2020-07-28T01:33:54Z)
FakeCovid -- A Multilingual Cross-domain Fact Check News Dataset for COVID-19 [0.0]
We present a first multilingual cross-domain dataset of 5182 fact-checked news articles for COVID-19. We have collected the fact-checked articles from 92 different fact-checking websites after obtaining references from Poynter and Snopes. The dataset is in 40 languages from 105 countries.
arXiv Detail & Related papers (2020-06-19T19:48:00Z)
Batch Clustering for Multilingual News Streaming [0.0]
Large volume of diverse and unorganized information makes reading difficult or almost impossible. We process articles per batch, looking for monolingual local topics which are then linked across time and languages. Our system gives monolingual state-of-the-art results on dataset of Spanish and German news and crosslingual state-of-the-art results on English, Spanish and German news.
arXiv Detail & Related papers (2020-04-17T08:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.