3DLNews: A Three-decade Dataset of US Local News Articles
- URL: http://arxiv.org/abs/2408.04716v1
- Date: Thu, 08 Aug 2024 18:33:37 GMT
- Title: 3DLNews: A Three-decade Dataset of US Local News Articles
- Authors: Gangani Ariyarathne, Alexander C. Nwala,
- Abstract summary: 3DLNews is a novel dataset with local news articles from the United States spanning the period from 1996 to 2024.
It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states.
- Score: 49.1574468325115
- License:
- Abstract: We present 3DLNews, a novel dataset with local news articles from the United States spanning the period from 1996 to 2024. It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states, and provides a broad snapshot of the US local news landscape. The dataset was collected by scraping Google and Twitter search results. We employed a multi-step filtering process to remove non-news article links and enriched the dataset with metadata such as the names and geo-coordinates of the source news media organizations, article publication dates, etc. Furthermore, we demonstrated the utility of 3DLNews by outlining four applications.
Related papers
- Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - News Category Dataset [1.7513645771137178]
We present a News Category dataset that contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
In this paper, we produce some novel insights from the dataset and describe various existing and potential applications of our dataset.
arXiv Detail & Related papers (2022-09-23T06:13:16Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - NELA-Local: A Dataset of U.S. Local News Articles for the Study of
County-level News Ecosystems [4.977804197346136]
We present a dataset of over 1.4M online news articles from 313 local U.S. outlets.
These outlets cover a geographically diverse set of communities across the United States.
arXiv Detail & Related papers (2022-03-16T13:19:21Z) - Multilingual Open Text 1.0: Public Domain News in 44 Languages [2.642698101441705]
The first release of the corpus contains over 2.7 million news articles and 1 million shorter passages published between 2001--2021.
The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0) and all software used to create the corpus is released under the MIT License.
arXiv Detail & Related papers (2022-01-14T18:58:17Z) - \textit{NewsEdits}: A Dataset of Revision Histories for News Articles
(Technical Report: Data Processing) [89.77347919191774]
textitNewsEdits is the first publicly available dataset of news article revision histories.
It contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2021-04-19T21:15:30Z) - Counting Protests in News Articles: A Dataset and Semi-Automated Data
Collection Pipeline [0.0]
Between January 2017 and January 2021, thousands of local news sources in the United States reported on over 42,000 protests about topics such as civil rights, immigration, guns, and the environment.
We release a dataset of news article URLs, dates, locations, crowd size estimates, and 494 discrete descriptive tags corresponding to 42,347 reported protest events in the United States between January 2017 and January 2021.
arXiv Detail & Related papers (2021-02-01T15:35:21Z) - 365 Dots in 2019: Quantifying Attention of News Sources [69.50862982117125]
We measure the overlap of topics of online news articles from a variety of sources.
We score news stories according to the degree of attention in near-real time.
This can enable multiple studies, including identifying topics that receive the most attention.
arXiv Detail & Related papers (2020-03-22T20:32:47Z) - HoaxItaly: a collection of Italian disinformation and fact-checking
stories shared on Twitter in 2019 [72.96986027203377]
The dataset includes also title and body for approximately 37k news articles.
It is publicly available at https://doi.org/10.79DVN/ PGVDHX.
arXiv Detail & Related papers (2020-01-29T16:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.