\textit{NewsEdits}: A Dataset of Revision Histories for News Articles
(Technical Report: Data Processing)
- URL: http://arxiv.org/abs/2104.09647v1
- Date: Mon, 19 Apr 2021 21:15:30 GMT
- Title: \textit{NewsEdits}: A Dataset of Revision Histories for News Articles
(Technical Report: Data Processing)
- Authors: Alexander Spangher and Jonathan May
- Abstract summary: textitNewsEdits is the first publicly available dataset of news article revision histories.
It contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources.
- Score: 89.77347919191774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: News article revision histories have the potential to give us novel insights
across varied fields of linguistics and social sciences. In this work, we
present, to our knowledge, the first publicly available dataset of news article
revision histories, or \textit{NewsEdits}.
Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430
versions from over 22 English- and French-language newspaper sources based in
three countries. Across version pairs, we count 10.9 million added sentences;
8.9 million changed sentences and 6.8 million removed sentences. Within the
changed sentences, we derive 72 million atomic edits. \textit{NewsEdits} is, to
our knowledge, the largest corpus of revision histories of any domain.
Related papers
- Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - Multilingual Open Text 1.0: Public Domain News in 44 Languages [2.642698101441705]
The first release of the corpus contains over 2.7 million news articles and 1 million shorter passages published between 2001--2021.
The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0) and all software used to create the corpus is released under the MIT License.
arXiv Detail & Related papers (2022-01-14T18:58:17Z) - A System for Worldwide COVID-19 Information Aggregation [92.60866520230803]
We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics.
A neural machine translation module translates articles in other languages into Japanese and English.
A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently.
arXiv Detail & Related papers (2020-07-28T01:33:54Z) - FakeCovid -- A Multilingual Cross-domain Fact Check News Dataset for
COVID-19 [0.0]
We present a first multilingual cross-domain dataset of 5182 fact-checked news articles for COVID-19.
We have collected the fact-checked articles from 92 different fact-checking websites after obtaining references from Poynter and Snopes.
The dataset is in 40 languages from 105 countries.
arXiv Detail & Related papers (2020-06-19T19:48:00Z) - Batch Clustering for Multilingual News Streaming [0.0]
Large volume of diverse and unorganized information makes reading difficult or almost impossible.
We process articles per batch, looking for monolingual local topics which are then linked across time and languages.
Our system gives monolingual state-of-the-art results on dataset of Spanish and German news and crosslingual state-of-the-art results on English, Spanish and German news.
arXiv Detail & Related papers (2020-04-17T08:59:13Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.