Multi-Platform Aggregated Dataset of Online Communities (MADOC)
- URL: http://arxiv.org/abs/2501.12886v1
- Date: Wed, 22 Jan 2025 14:02:11 GMT
- Title: Multi-Platform Aggregated Dataset of Online Communities (MADOC)
- Authors: Marija Mitrović Dankulov, Aleksandar Tomašević, Slobodan Maletić, Miroslav Anđelković, Ana Vranić, Darja Cvetković, Boris Stupovski, Dušan Vudragović, Sara Major, Aleksandar Bogojević,
- Abstract summary: MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.
The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
- Score: 64.45797970830233
- License:
- Abstract: The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing FAIR-compliant standardized access to cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users. The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis. By providing UUID-anonymized user histories and temporal alignment of banned communities' activity patterns, MADOC supports research on content moderation impacts and platform migration trends. Distributed via Zenodo with persistent identifiers and Python/R toolkits, the dataset adheres to FAIR principles while addressing post-API-era research challenges through ethical aggregation of public social media archives.
Related papers
- Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data)
The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z) - Modeling offensive content detection for TikTok [0.0]
This research undertakes the collection and analysis of TikTok data containing offensive content.
It builds a series of machine learning and deep learning models for offensive content detection.
arXiv Detail & Related papers (2024-08-29T18:47:41Z) - Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics.
Access to these datasets is often restricted due to costs and platform regulations.
This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z) - The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition [64.5207572897806]
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems.
In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals.
The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor dataset.
arXiv Detail & Related papers (2024-06-11T22:26:20Z) - Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration [60.535793237063885]
The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet.
The impact of this surge in AIGC on Information Retrieval systems remains an open question.
We introduce Cocktail, a benchmark tailored for evaluating IR models in this mixed-sourced data landscape.
arXiv Detail & Related papers (2024-05-26T12:30:20Z) - iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023 [22.685953309889825]
We release a large-scale dataset from Scored, an alternative Reddit platform.
At least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception.
We provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model.
arXiv Detail & Related papers (2024-05-16T16:34:03Z) - The DSA Transparency Database: Auditing Self-reported Moderation Actions by Social Media [0.4597131601929317]
We analyze all 353.12M records submitted by the eight largest social media platforms in the EU during the first 100 days of the database.
Our findings have far-reaching implications for policymakers and scholars across diverse disciplines.
arXiv Detail & Related papers (2023-12-16T00:02:49Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.