Counting Protests in News Articles: A Dataset and Semi-Automated Data
Collection Pipeline
- URL: http://arxiv.org/abs/2102.00917v1
- Date: Mon, 1 Feb 2021 15:35:21 GMT
- Title: Counting Protests in News Articles: A Dataset and Semi-Automated Data
Collection Pipeline
- Authors: Tommy Leung, L. Nathan Perkins
- Abstract summary: Between January 2017 and January 2021, thousands of local news sources in the United States reported on over 42,000 protests about topics such as civil rights, immigration, guns, and the environment.
We release a dataset of news article URLs, dates, locations, crowd size estimates, and 494 discrete descriptive tags corresponding to 42,347 reported protest events in the United States between January 2017 and January 2021.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Between January 2017 and January 2021, thousands of local news sources in the
United States reported on over 42,000 protests about topics such as civil
rights, immigration, guns, and the environment. Given the vast number of local
journalists that report on protests daily, extracting these events as
structured data to understand temporal and geographic trends can empower civic
decision-making. However, the task of extracting events from news articles
presents well known challenges to the NLP community in the fields of domain
detection, slot filling, and coreference resolution.
To help improve the resources available for extracting structured data from
news stories, our contribution is three-fold. We 1) release a manually labeled
dataset of news article URLs, dates, locations, crowd size estimates, and 494
discrete descriptive tags corresponding to 42,347 reported protest events in
the United States between January 2017 and January 2021; 2) describe the
semi-automated data collection pipeline used to discover, sort, and review the
144,568 English articles that comprise the dataset; and 3) benchmark a
long-short term memory (LSTM) low dimensional classifier that demonstrates the
utility of processing news articles based on syntactic structures, such as
paragraphs and sentences, to count the number of reported protest events.
Related papers
- 3DLNews: A Three-decade Dataset of US Local News Articles [49.1574468325115]
3DLNews is a novel dataset with local news articles from the United States spanning the period from 1996 to 2024.
It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states.
arXiv Detail & Related papers (2024-08-08T18:33:37Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - Multi-modal News Understanding with Professionally Labelled Videos
(ReutersViLNews) [25.78619140103048]
We present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset.
The dataset focuses on high-level video-language understanding with an emphasis on long-form news.
The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms.
arXiv Detail & Related papers (2024-01-23T00:42:04Z) - SumREN: Summarizing Reported Speech about Events in News [51.82314543729287]
We propose the novel task of summarizing the reactions of different speakers, as expressed by their reported statements, to a given event.
We create a new multi-document summarization benchmark, SUMREN, comprising 745 summaries of reported statements from various public figures.
arXiv Detail & Related papers (2022-12-02T12:51:39Z) - Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021 [55.41644538483948]
The goal of the shared task is to motivate the community to come up with efficient methods for solving this vital problem.
The training set contains 1300 annotated news articles -- 750 real news, 550 fake news, while the testing set contains 300 news articles -- 200 real, 100 fake news.
The best performing system obtained an F1-macro score of 0.679, which is lower than the past year's best result of 0.907 F1-macro.
arXiv Detail & Related papers (2022-07-11T18:58:36Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - NELA-Local: A Dataset of U.S. Local News Articles for the Study of
County-level News Ecosystems [4.977804197346136]
We present a dataset of over 1.4M online news articles from 313 local U.S. outlets.
These outlets cover a geographically diverse set of communities across the United States.
arXiv Detail & Related papers (2022-03-16T13:19:21Z) - A German Corpus for Fine-Grained Named Entity Recognition and Relation
Extraction of Traffic and Industry Events [63.08899104652265]
This work describes a corpus of German-language documents which has been annotated with fine-grained geo-entities.
It has also been annotated with a set of 15 traffic- and industry-related n-ary relations and events.
The corpus consists of newswire texts, Twitter messages, and traffic reports from radio stations, police and railway companies.
arXiv Detail & Related papers (2020-04-07T11:39:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.