The Pushshift Reddit Dataset
- URL:
- Date: Thu, 23 Jan 2020 10:31:29 GMT
- Title: The Pushshift Reddit Dataset
- Authors: Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire,
Jeremy Blackburn
- Abstract summary: Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data.
Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception.
- Score: 1.5661920010658625
- License:
- Abstract: Social media data has become crucial to the advancement of scientific
understanding. However, even though it has become ubiquitous, just collecting
large-scale social media data involves a high degree of engineering skill set
and computational resources. In fact, research is often times gated by data
engineering problems that must be overcome before analysis can proceed. This
has resulted recognition of datasets as meaningful research contributions in
and of themselves. Reddit, the so called "front page of the Internet," in
particular has been the subject of numerous scientific studies. Although Reddit
is relatively open to data acquisition compared to social media platforms like
Facebook and Twitter, the technical barriers to acquisition still remain. Thus,
Reddit's millions of subreddits, hundreds of millions of users, and hundreds of
billions of comments are at the same time relatively accessible, but time
consuming to collect and analyze systematically. In this paper, we present the
Pushshift Reddit dataset. Pushshift is a social media data collection,
analysis, and archiving platform that since 2015 has collected Reddit data and
made it available to researchers. Pushshift's Reddit dataset is updated in
real-time, and includes historical data back to Reddit's inception. In addition
to monthly dumps, Pushshift provides computational tools to aid in searching,
aggregating, and performing exploratory analysis on the entirety of the
dataset. The Pushshift Reddit dataset makes it possible for social media
researchers to reduce time spent in the data collection, cleaning, and storage
phases of their projects.
Related papers
- Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data)
The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z) - "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data [0.18416014644193066]
We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts.
arXiv Detail & Related papers (2024-04-29T16:43:39Z) - Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future [59.78608958395464]
We build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets.
Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects.
We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
arXiv Detail & Related papers (2024-02-28T00:22:42Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Data+Shift: Supporting visual investigation of data distribution shifts
by data scientists [1.6311150636417262]
Data+Shift is a visual analytics tool to support data scientists in the task of investigating the underlying factors of shift in data features.
We validated our approach with a think-aloud experiment where a data scientist used the tool for a fraud detection use case.
arXiv Detail & Related papers (2022-04-29T11:50:25Z) - Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions.
We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity.
Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z) - Reliable and Efficient Long-Term Social Media Monitoring [4.389610557232119]
This technical report presents a cloud-based data collection, pre-processing, and archiving infrastructure.
We show how this approach works in different cloud computing architectures, and how to adapt the method to collect streaming data from other social media platforms.
arXiv Detail & Related papers (2020-05-05T19:04:56Z) - Echo Chambers on Social Media: A comparative analysis [64.2256216637683]
We introduce an operational definition of echo chambers and perform a massive comparative analysis on 1B pieces of contents produced by 1M users on four social media platforms.
We infer the leaning of users about controversial topics and reconstruct their interaction networks by analyzing different features.
We find support for the hypothesis that platforms implementing news feed algorithms like Facebook may elicit the emergence of echo-chambers.
arXiv Detail & Related papers (2020-04-20T20:00:27Z) - Curating Social Media Data [0.0]
We propose a data curation pipeline, namely CrowdCorrect, to enable analysts cleansing and curating social data.
Our pipeline provides an automatic feature extraction from a corpus of social media data using existing in-house tools.
The implementation of this pipeline also includes a set of tools for automatically creating micro-tasks to facilitate the contribution of crowd users in curating the raw data.
arXiv Detail & Related papers (2020-02-21T10:07:15Z) - The Pushshift Telegram Dataset [1.7109522466982476]
We present a dataset from one such mobile messaging platform: Telegram.
Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users.
arXiv Detail & Related papers (2020-01-23T10:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.