Related papers: iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023

iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023

URL: http://arxiv.org/abs/2405.10233v1
Date: Thu, 16 May 2024 16:34:03 GMT
Title: iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023
Authors: Jay Patel, Pujan Paudel, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn,
Abstract summary: We release a large-scale dataset from Scored, an alternative Reddit platform. At least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. We provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model.
Score: 22.685953309889825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online web communities often face bans for violating platform policies, encouraging their migration to alternative platforms. This migration, however, can result in increased toxicity and unforeseen consequences on the new platform. In recent years, researchers have collected data from many alternative platforms, indicating coordinated efforts leading to offline events, conspiracy movements, hate speech propagation, and harassment. Thus, it becomes crucial to characterize and understand these alternative platforms. To advance research in this direction, we collect and release a large-scale dataset from Scored -- an alternative Reddit platform that sheltered banned fringe communities, for example, c/TheDonald (a prominent right-wing community) and c/GreatAwakening (a conspiratorial community). Over four years, we collected approximately 57M posts from Scored, with at least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. Furthermore, we provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model, to further advance the field in characterizing the discussions within these communities. We aim to provide these resources to facilitate their investigations without the need for extensive data collection and processing efforts.

Related papers

Reddit Deplatforming and Toxicity Dynamics on Generalist Voat Communities [73.88859384645264]
Deplatforming, the permanent banning of entire communities, is a primary tool for content moderation on mainstream platforms.<n>We analyze four major Reddit ban waves (2015--2020) and their effects on generalist communities on Voat.
arXiv Detail & Related papers (2025-12-26T19:13:45Z)
AI Didn't Start the Fire: Examining the Stack Exchange Moderator and Contributor Strike [6.538542549579634]
We investigate a conflict between the Stack Exchange platform and community that occurred in 2023 around an emergency arising from the release of large language models (LLMs)<n>We show how the 2023 conflict was preceded by a long-term deterioration in the community-platform relationship driven in particular by the platform's disregard for the community's highly-valued participatory role in governance.<n>We recommend ways that platforms and communities can institute participatory governance to be durable and effective.
arXiv Detail & Related papers (2025-12-09T18:19:42Z)
Community Moderation and the New Epistemology of Fact Checking on Social Media [124.26693978503339]
Social media platforms have traditionally relied on independent fact-checking organizations to identify and flag misleading content.<n>X (formerly Twitter) and Meta have shifted towards community-driven content moderation by launching their own versions of crowd-sourced fact-checking.<n>We examine the current approaches to misinformation detection across major platforms, explore the emerging role of community-driven moderation, and critically evaluate both the promises and challenges of crowd-checking at scale.
arXiv Detail & Related papers (2025-05-26T14:50:18Z)
Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users. The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z)
Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election [49.35115948941981]
We present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election. We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.
arXiv Detail & Related papers (2024-12-17T17:08:35Z)
Characterizing the Fragmentation of the Social Media Ecosystem [39.58317527488534]
We use a dataset of 126M URLs posted by nearly 6M users on nine social media platforms. We find a clear separation between mainstream and alt-tech platforms. These findings outline the main dimensions defining the fragmentation and polarization of the social media ecosystem.
arXiv Detail & Related papers (2024-11-25T18:45:03Z)
Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data) The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z)
On the Use of Proxies in Political Ad Targeting [49.61009579554272]
We show that major political advertisers circumvented mitigations by targeting proxy attributes. Our findings have crucial implications for the ongoing discussion on the regulation of political advertising.
arXiv Detail & Related papers (2024-10-18T17:15:13Z)
MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection [2.433983268807517]
Hate speech poses significant social, psychological, and occasionally physical threats to targeted individuals and communities. Current computational linguistic approaches for tackling this phenomenon rely on labelled social media datasets for training. We scrutinized over 60 datasets, selectively integrating those pertinent into MetaHate. Our findings contribute to a deeper understanding of the existing datasets, paving the way for training more robust and adaptable models.
arXiv Detail & Related papers (2024-01-12T11:54:53Z)
Design and analysis of tweet-based election models for the 2021 Mexican legislative election [55.41644538483948]
We use a dataset of 15 million election-related tweets in the six months preceding election day. We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods.
arXiv Detail & Related papers (2023-01-02T12:40:05Z)
Understanding Online Migration Decisions Following the Banning of Radical Communities [0.2752817022620644]
We study how factors associated with the RECRO radicalization framework relate to users' migration decisions. Our results show that individual-level factors, those relating to the behavior of users, are associated with the decision to post on the fringe platform.
arXiv Detail & Related papers (2022-12-09T10:43:15Z)
"I Can't Keep It Up." A Dataset from the Defunct Voat.co News Aggregator [0.0]
Voat.co was a news aggregator website that shut down on December 25, 2020. This paper presents a dataset with over 2.3M submissions and 16.2M comments posted from 113K users in 7.1K subverses.
arXiv Detail & Related papers (2022-01-15T23:25:53Z)
This Must Be the Place: Predicting Engagement of Online Communities in a Large-scale Distributed Campaign [70.69387048368849]
We study the behavior of communities with millions of active members. We develop a hybrid model, combining textual cues, community meta-data, and structural properties. We demonstrate the applicability of our model through Reddit's r/place a large-scale online experiment.
arXiv Detail & Related papers (2022-01-14T08:23:16Z)
News consumption and social media regulations policy [70.31753171707005]
We analyze two social media that enforced opposite moderation methods, Twitter and Gab, to assess the interplay between news consumption and content regulation. Our results show that the presence of moderation pursued by Twitter produces a significant reduction of questionable content. The lack of clear regulation on Gab results in the tendency of the user to engage with both types of content, showing a slight preference for the questionable ones which may account for a dissing/endorsement behavior.
arXiv Detail & Related papers (2021-06-07T19:26:32Z)
Do Platform Migrations Compromise Content Moderation? Evidence from r/The_Donald and r/Incels [20.41491269475746]
We report the results of a large-scale observational study of how problematic online communities progress following community-level moderation measures. Our results suggest that, in both cases, moderation measures significantly decreased posting activity on the new platform. In spite of that, users in one of the studied communities showed increases in signals associated with toxicity and radicalization.
arXiv Detail & Related papers (2020-10-20T16:03:06Z)
An Iterative Approach for Identifying Complaint Based Tweets in Social Media Platforms [76.9570531352697]
We propose an iterative methodology which aims to identify complaint based posts pertaining to the transport domain. We perform comprehensive evaluations along with releasing a novel dataset for the research purposes.
arXiv Detail & Related papers (2020-01-24T22:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.