NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge
- URL: http://arxiv.org/abs/2206.07106v1
- Date: Tue, 14 Jun 2022 18:47:13 GMT
- Title: NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge
- Authors: Alexander Spangher, Xiang Ren, Jonathan May and Nanyun Peng
- Abstract summary: NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
- Score: 122.37011526554403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: News article revision histories provide clues to narrative and factual
evolution in news articles. To facilitate analysis of this evolution, we
present the first publicly available dataset of news revision histories,
NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million
articles with 4.6 million versions from over 22 English- and French-language
newspaper sources based in three countries, spanning 15 years of coverage
(2006-2021).
We define article-level edit actions: Addition, Deletion, Edit and Refactor,
and develop a high-accuracy extraction algorithm to identify these actions. To
underscore the factual nature of many edit actions, we conduct analyses showing
that added and deleted sentences are more likely to contain updating events,
main content and quotes than unchanged sentences.
Finally, to explore whether edit actions are predictable, we introduce three
novel tasks aimed at predicting actions performed during version updates. We
show that these tasks are possible for expert humans but are challenging for
large NLP models. We hope this can spur research in narrative framing and help
provide predictive tools for journalists chasing breaking news.
Related papers
- Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models [11.597314728459573]
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.
We propose STORM, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking.
arXiv Detail & Related papers (2024-02-22T01:20:17Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Text2Time: Transformer-based Article Time Period Prediction [0.11470070927586018]
This work investigates the problem of predicting the publication period of a text document, specifically a news article, based on its textual content.
We create our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades.
In our approach, we use a pretrained BERT model fine-tuned for the task of text classification, specifically for time period prediction.
arXiv Detail & Related papers (2023-04-21T10:05:03Z) - Designing and Evaluating Interfaces that Highlight News Coverage
Diversity Using Discord Questions [84.55145223950427]
This paper shows that navigating large source collections for a news story can be challenging without further guidance.
We design three interfaces -- the Annotated Article, the Recomposed Article, and the Question Grid -- aimed at accompanying news readers in discovering coverage diversity while they read.
arXiv Detail & Related papers (2023-02-17T16:59:31Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - No News is Good News: A Critique of the One Billion Word Benchmark [4.396860522241306]
The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl.
We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift.
arXiv Detail & Related papers (2021-10-25T02:41:27Z) - \textit{NewsEdits}: A Dataset of Revision Histories for News Articles
(Technical Report: Data Processing) [89.77347919191774]
textitNewsEdits is the first publicly available dataset of news article revision histories.
It contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2021-04-19T21:15:30Z) - Viable Threat on News Reading: Generating Biased News Using Natural
Language Models [49.90665530780664]
We show that publicly available language models can reliably generate biased news content based on an input original news.
We also show that a large number of high-quality biased news articles can be generated using controllable text generation.
arXiv Detail & Related papers (2020-10-05T16:55:39Z) - CompRes: A Dataset for Narrative Structure in News [2.4578723416255754]
We introduce CompRes -- the first dataset for narrative structure in news media.
We use the annotated dataset to train several supervised models to identify the different narrative elements.
arXiv Detail & Related papers (2020-07-09T15:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.