Related papers: The Pushshift Reddit Dataset

The Pushshift Reddit Dataset

URL: http://arxiv.org/abs/2001.08435v1
Date: Thu, 23 Jan 2020 10:31:29 GMT
Title: The Pushshift Reddit Dataset
Authors: Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, Jeremy Blackburn
Abstract summary: Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception.
Score: 1.5661920010658625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social media data has become crucial to the advancement of scientific understanding. However, even though it has become ubiquitous, just collecting large-scale social media data involves a high degree of engineering skill set and computational resources. In fact, research is often times gated by data engineering problems that must be overcome before analysis can proceed. This has resulted recognition of datasets as meaningful research contributions in and of themselves. Reddit, the so called "front page of the Internet," in particular has been the subject of numerous scientific studies. Although Reddit is relatively open to data acquisition compared to social media platforms like Facebook and Twitter, the technical barriers to acquisition still remain. Thus, Reddit's millions of subreddits, hundreds of millions of users, and hundreds of billions of comments are at the same time relatively accessible, but time consuming to collect and analyze systematically. In this paper, we present the Pushshift Reddit dataset. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception. In addition to monthly dumps, Pushshift provides computational tools to aid in searching, aggregating, and performing exploratory analysis on the entirety of the dataset. The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.

Related papers

A Datalake for Data-driven Social Science Research [2.285735909183272]
We present a Datalake infrastructure tailored to the needs of interdisciplinary social science research.<n>Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis.<n>We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations.
arXiv Detail & Related papers (2025-12-02T06:40:47Z)
WikiReddit: Tracing Information and Attention Flows Between Online Platforms [0.0]
This dataset captures all Wikipedia mentions and links shared in posts and comments on Reddit 2020-2023, excluding those from private and NSFW subreddits. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
arXiv Detail & Related papers (2025-02-07T14:03:46Z)
Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users. The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z)
Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data) The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z)
"I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data [0.18416014644193066]
We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts.
arXiv Detail & Related papers (2024-04-29T16:43:39Z)
Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future [59.78608958395464]
We build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets. Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects. We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
arXiv Detail & Related papers (2024-02-28T00:22:42Z)
ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information. To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles. Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z)
Data+Shift: Supporting visual investigation of data distribution shifts by data scientists [1.6311150636417262]
Data+Shift is a visual analytics tool to support data scientists in the task of investigating the underlying factors of shift in data features. We validated our approach with a think-aloud experiment where a data scientist used the tool for a fraud detection use case.
arXiv Detail & Related papers (2022-04-29T11:50:25Z)
Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z)
Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community. Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z)
Reliable and Efficient Long-Term Social Media Monitoring [4.389610557232119]
This technical report presents a cloud-based data collection, pre-processing, and archiving infrastructure. We show how this approach works in different cloud computing architectures, and how to adapt the method to collect streaming data from other social media platforms.
arXiv Detail & Related papers (2020-05-05T19:04:56Z)
Echo Chambers on Social Media: A comparative analysis [64.2256216637683]
We introduce an operational definition of echo chambers and perform a massive comparative analysis on 1B pieces of contents produced by 1M users on four social media platforms. We infer the leaning of users about controversial topics and reconstruct their interaction networks by analyzing different features. We find support for the hypothesis that platforms implementing news feed algorithms like Facebook may elicit the emergence of echo-chambers.
arXiv Detail & Related papers (2020-04-20T20:00:27Z)
Curating Social Media Data [0.0]
We propose a data curation pipeline, namely CrowdCorrect, to enable analysts cleansing and curating social data. Our pipeline provides an automatic feature extraction from a corpus of social media data using existing in-house tools. The implementation of this pipeline also includes a set of tools for automatically creating micro-tasks to facilitate the contribution of crowd users in curating the raw data.
arXiv Detail & Related papers (2020-02-21T10:07:15Z)
The Pushshift Telegram Dataset [1.7109522466982476]
We present a dataset from one such mobile messaging platform: Telegram. Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users.
arXiv Detail & Related papers (2020-01-23T10:37:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.