A Python Tool for Reconstructing Full News Text from GDELT
- URL: http://arxiv.org/abs/2504.16063v1
- Date: Tue, 22 Apr 2025 17:40:42 GMT
- Title: A Python Tool for Reconstructing Full News Text from GDELT
- Authors: A. Fronzetti Colladon, R. Vestrelli,
- Abstract summary: This paper presents a novel approach to obtaining full-text newspaper articles at near-zero cost.<n>We focus on the GDELT Web News NGrams 3.0 dataset, which provides high-frequency updates of n-grams extracted from global online news sources.<n>We provide Python code to reconstruct full-text articles from these n-grams by identifying overlapping textual fragments and intelligently merging them.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: News data have become an essential resource across various disciplines, including economics, finance, management, social sciences, and computer science. Researchers leverage newspaper articles to study economic trends, market dynamics, corporate strategies, public perception, political discourse, and the evolution of public opinion. Additionally, news datasets have been instrumental in training large-scale language models, with applications in sentiment analysis, fake news detection, and automated news summarization. Despite their significance, access to comprehensive news corpora remains a key challenge. Many full-text news providers, such as Factiva and LexisNexis, require costly subscriptions, while free alternatives often suffer from incomplete data and transparency issues. This paper presents a novel approach to obtaining full-text newspaper articles at near-zero cost by leveraging data from the Global Database of Events, Language, and Tone (GDELT). Specifically, we focus on the GDELT Web News NGrams 3.0 dataset, which provides high-frequency updates of n-grams extracted from global online news sources. We provide Python code to reconstruct full-text articles from these n-grams by identifying overlapping textual fragments and intelligently merging them. Our method enables researchers to access structured, large-scale newspaper data for text analysis while overcoming the limitations of existing proprietary datasets. The proposed approach enhances the accessibility of news data for empirical research, facilitating applications in economic forecasting, computational social science, and natural language processing.
Related papers
- Bridging the Data Provenance Gap Across Text, Speech and Video [67.72097952282262]
We conduct the largest and first-of-its-kind longitudinal audit across modalities of popular text, speech, and video datasets.
Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.
We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets.
arXiv Detail & Related papers (2024-12-19T01:30:19Z) - Online Digital Investigative Journalism using SociaLens [0.0]
We introduce a versatile and autonomous investigative journalism tool, called em SociaLens, for identifying and extracting query specific data from online sources.
We envision its use in investigative journalism, law enforcement and social policy planning.
We illustrate the functionality of SociaLens using a focused case study on rape incidents in a developing country.
arXiv Detail & Related papers (2024-10-13T07:20:47Z) - Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach [0.0]
This paper presents a novel approach to financial news processing that leverages Large Language Models (LLMs)
We introduce a system that extracts relevant company tickers from raw news article content, performs sentiment analysis at the company level, and generates summaries.
We are the first data provider to offer granular, per-company sentiment analysis from news articles, enhancing the depth of information available to market participants.
arXiv Detail & Related papers (2024-07-22T16:47:31Z) - SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation [16.61347730523143]
We present a new corpus to facilitate the automated generation of scientific news reports.<n>Our dataset comprises academic publications and their corresponding scientific news reports across nine disciplines.<n>We benchmark our dataset employing state-of-the-art text generation models.
arXiv Detail & Related papers (2024-03-26T14:54:48Z) - Adapting Fake News Detection to the Era of Large Language Models [48.5847914481222]
We study the interplay between machine-(paraphrased) real news, machine-generated fake news, human-written fake news, and human-written real news.
Our experiments reveal an interesting pattern that detectors trained exclusively on human-written articles can indeed perform well at detecting machine-generated fake news, but not vice versa.
arXiv Detail & Related papers (2023-11-02T08:39:45Z) - Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News
Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection.
We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z) - fakenewsbr: A Fake News Detection Platform for Brazilian Portuguese [0.6775616141339018]
This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese.
We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF and Word2Vec.
We develop a user-friendly web platform, fakenewsbr.com, to facilitate the verification of news articles' veracity.
arXiv Detail & Related papers (2023-09-20T04:10:03Z) - American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers [7.161822501147275]
This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images.
It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection.
The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
arXiv Detail & Related papers (2023-08-24T00:24:42Z) - Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing.
We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.<n>To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.<n>Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.