Related papers: Multi-Platform Aggregated Dataset of Online Communities (MADOC)

Multi-Platform Aggregated Dataset of Online Communities (MADOC)

URL: http://arxiv.org/abs/2501.12886v1
Date: Wed, 22 Jan 2025 14:02:11 GMT
Title: Multi-Platform Aggregated Dataset of Online Communities (MADOC)
Authors: Marija Mitrović Dankulov, Aleksandar Tomašević, Slobodan Maletić, Miroslav Anđelković, Ana Vranić, Darja Cvetković, Boris Stupovski, Dušan Vudragović, Sara Major, Aleksandar Bogojević,
Abstract summary: MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.<n>The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
Score: 64.45797970830233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing FAIR-compliant standardized access to cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users. The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis. By providing UUID-anonymized user histories and temporal alignment of banned communities' activity patterns, MADOC supports research on content moderation impacts and platform migration trends. Distributed via Zenodo with persistent identifiers and Python/R toolkits, the dataset adheres to FAIR principles while addressing post-API-era research challenges through ethical aggregation of public social media archives.

Related papers

Community Moderation and the New Epistemology of Fact Checking on Social Media [124.26693978503339]
Social media platforms have traditionally relied on independent fact-checking organizations to identify and flag misleading content.<n>X (formerly Twitter) and Meta have shifted towards community-driven content moderation by launching their own versions of crowd-sourced fact-checking.<n>We examine the current approaches to misinformation detection across major platforms, explore the emerging role of community-driven moderation, and critically evaluate both the promises and challenges of crowd-checking at scale.
arXiv Detail & Related papers (2025-05-26T14:50:18Z)
Bridging the Narrative Divide: Cross-Platform Discourse Networks in Fragmented Ecosystems [9.119607936530038]
Political discourse increasingly fragmented across different social networks.<n>To understand how narratives traverse fragmented ecosystems, we offer a structural lens for anticipating how narratives traverse ecosystems.<n>These findings offer implications for crossplatform governance, content moderation, and policy interventions.
arXiv Detail & Related papers (2025-05-22T16:53:52Z)
Post-Post-API Age: Studying Digital Platforms in Scant Data Access Times [5.997153455641738]
The "post-API age" has sparked optimism about increased platform transparency and renewed opportunities for comprehensive research on digital platforms.<n>However, it remains unclear whether platforms provide adequate data access in practice.<n>Our findings reveal significant challenges in accessing social media data.<n>These challenges have exacerbated existing institutional, regional, and financial inequities in data access.
arXiv Detail & Related papers (2025-05-15T00:47:06Z)
Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data) The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z)
Modeling offensive content detection for TikTok [0.0]
This research undertakes the collection and analysis of TikTok data containing offensive content. It builds a series of machine learning and deep learning models for offensive content detection.
arXiv Detail & Related papers (2024-08-29T18:47:41Z)
Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. Access to these datasets is often restricted due to costs and platform regulations. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z)
The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition [64.5207572897806]
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems. In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals. The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor dataset.
arXiv Detail & Related papers (2024-06-11T22:26:20Z)
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration [60.535793237063885]
The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet. The impact of this surge in AIGC on Information Retrieval systems remains an open question. We introduce Cocktail, a benchmark tailored for evaluating IR models in this mixed-sourced data landscape.
arXiv Detail & Related papers (2024-05-26T12:30:20Z)
iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023 [22.685953309889825]
We release a large-scale dataset from Scored, an alternative Reddit platform. At least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. We provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model.
arXiv Detail & Related papers (2024-05-16T16:34:03Z)
The DSA Transparency Database: Auditing Self-reported Moderation Actions by Social Media [0.4597131601929317]
We analyze all 353.12M records submitted by the eight largest social media platforms in the EU during the first 100 days of the database. Our findings have far-reaching implications for policymakers and scholars across diverse disciplines.
arXiv Detail & Related papers (2023-12-16T00:02:49Z)
Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets. We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z)
Analyzing User Engagement with TikTok's Short Format Video Recommendations using Data Donations [31.764672446151412]
We analyze user engagement on TikTok using data we collect via a data donation system. We find that the average daily usage time increases over the users' lifetime while the user attention remains stable at around 45%. We also find that users like more videos uploaded by people they follow than those recommended by people they do not follow.
arXiv Detail & Related papers (2023-01-12T11:34:45Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.