iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023
- URL: http://arxiv.org/abs/2405.10233v1
- Date: Thu, 16 May 2024 16:34:03 GMT
- Title: iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023
- Authors: Jay Patel, Pujan Paudel, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn,
- Abstract summary: We release a large-scale dataset from Scored, an alternative Reddit platform.
At least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception.
We provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model.
- Score: 22.685953309889825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online web communities often face bans for violating platform policies, encouraging their migration to alternative platforms. This migration, however, can result in increased toxicity and unforeseen consequences on the new platform. In recent years, researchers have collected data from many alternative platforms, indicating coordinated efforts leading to offline events, conspiracy movements, hate speech propagation, and harassment. Thus, it becomes crucial to characterize and understand these alternative platforms. To advance research in this direction, we collect and release a large-scale dataset from Scored -- an alternative Reddit platform that sheltered banned fringe communities, for example, c/TheDonald (a prominent right-wing community) and c/GreatAwakening (a conspiratorial community). Over four years, we collected approximately 57M posts from Scored, with at least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. Furthermore, we provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model, to further advance the field in characterizing the discussions within these communities. We aim to provide these resources to facilitate their investigations without the need for extensive data collection and processing efforts.
Related papers
- Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.
The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z) - Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election [49.35115948941981]
We present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election.
We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.
arXiv Detail & Related papers (2024-12-17T17:08:35Z) - Characterizing the Fragmentation of the Social Media Ecosystem [39.58317527488534]
We use a dataset of 126M URLs posted by nearly 6M users on nine social media platforms.
We find a clear separation between mainstream and alt-tech platforms.
These findings outline the main dimensions defining the fragmentation and polarization of the social media ecosystem.
arXiv Detail & Related papers (2024-11-25T18:45:03Z) - Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data)
The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z) - On the Use of Proxies in Political Ad Targeting [49.61009579554272]
We show that major political advertisers circumvented mitigations by targeting proxy attributes.
Our findings have crucial implications for the ongoing discussion on the regulation of political advertising.
arXiv Detail & Related papers (2024-10-18T17:15:13Z) - MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection [2.433983268807517]
Hate speech poses significant social, psychological, and occasionally physical threats to targeted individuals and communities.
Current computational linguistic approaches for tackling this phenomenon rely on labelled social media datasets for training.
We scrutinized over 60 datasets, selectively integrating those pertinent into MetaHate.
Our findings contribute to a deeper understanding of the existing datasets, paving the way for training more robust and adaptable models.
arXiv Detail & Related papers (2024-01-12T11:54:53Z) - Understanding Online Migration Decisions Following the Banning of
Radical Communities [0.2752817022620644]
We study how factors associated with the RECRO radicalization framework relate to users' migration decisions.
Our results show that individual-level factors, those relating to the behavior of users, are associated with the decision to post on the fringe platform.
arXiv Detail & Related papers (2022-12-09T10:43:15Z) - This Must Be the Place: Predicting Engagement of Online Communities in a
Large-scale Distributed Campaign [70.69387048368849]
We study the behavior of communities with millions of active members.
We develop a hybrid model, combining textual cues, community meta-data, and structural properties.
We demonstrate the applicability of our model through Reddit's r/place a large-scale online experiment.
arXiv Detail & Related papers (2022-01-14T08:23:16Z) - Do Platform Migrations Compromise Content Moderation? Evidence from
r/The_Donald and r/Incels [20.41491269475746]
We report the results of a large-scale observational study of how problematic online communities progress following community-level moderation measures.
Our results suggest that, in both cases, moderation measures significantly decreased posting activity on the new platform.
In spite of that, users in one of the studied communities showed increases in signals associated with toxicity and radicalization.
arXiv Detail & Related papers (2020-10-20T16:03:06Z) - An Iterative Approach for Identifying Complaint Based Tweets in Social
Media Platforms [76.9570531352697]
We propose an iterative methodology which aims to identify complaint based posts pertaining to the transport domain.
We perform comprehensive evaluations along with releasing a novel dataset for the research purposes.
arXiv Detail & Related papers (2020-01-24T22:23:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.