Related papers: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

URL: http://arxiv.org/abs/2001.07487v2
Date: Wed, 1 Apr 2020 13:57:35 GMT
Title: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board
Authors: Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn
Abstract summary: This paper presents a dataset with over 3.3M threads and 134.5M posts from the imageboard forum 4chan. To the best of our knowledge, this represents the largest publicly available 4chan dataset. We hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing.
Score: 12.14455026524814
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.

Related papers

3DLNews: A Three-decade Dataset of US Local News Articles [49.1574468325115]
3DLNews is a novel dataset with local news articles from the United States spanning the period from 1996 to 2024. It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states.
arXiv Detail & Related papers (2024-08-08T18:33:37Z)
iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023 [22.685953309889825]
We release a large-scale dataset from Scored, an alternative Reddit platform. At least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. We provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model.
arXiv Detail & Related papers (2024-05-16T16:34:03Z)
Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B. We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively. We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z)
Wiki-based Communities of Interest: Demographics and Outliers [18.953455338226103]
Identified from Wiki-based sources, the data covers 7.5k communities, such as members of the White House Coronavirus Task Force. We release subject-centric and group-centric datasets in format, as well as a browsing interface.
arXiv Detail & Related papers (2023-03-16T09:58:11Z)
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation [72.18912216025029]
We present DisinfoMeme to help detect disinformation memes. The dataset contains memes mined from Reddit covering three current topics: the COVID-19 pandemic, the Black Lives Matter movement, and veganism/vegetarianism.
arXiv Detail & Related papers (2022-05-25T09:54:59Z)
"I Can't Keep It Up." A Dataset from the Defunct Voat.co News Aggregator [0.0]
Voat.co was a news aggregator website that shut down on December 25, 2020. This paper presents a dataset with over 2.3M submissions and 16.2M comments posted from 113K users in 7.1K subverses.
arXiv Detail & Related papers (2022-01-15T23:25:53Z)
Reducing Target Group Bias in Hate Speech Detectors [56.94616390740415]
We show that text classification models trained on large publicly available datasets, may significantly under-perform on several protected groups. We propose to perform token-level hate sense disambiguation, and utilize tokens' hate sense representations for detection.
arXiv Detail & Related papers (2021-12-07T17:49:34Z)
News consumption and social media regulations policy [70.31753171707005]
We analyze two social media that enforced opposite moderation methods, Twitter and Gab, to assess the interplay between news consumption and content regulation. Our results show that the presence of moderation pursued by Twitter produces a significant reduction of questionable content. The lack of clear regulation on Gab results in the tendency of the user to engage with both types of content, showing a slight preference for the questionable ones which may account for a dissing/endorsement behavior.
arXiv Detail & Related papers (2021-06-07T19:26:32Z)
Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content. The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z)
Measuring and Characterizing Hate Speech on News Websites [13.289076063197466]
We analyze 125M comments posted on 412K news articles over the course of 19 months. We find statistically significant increases in hateful commenting activity around real-world divisive events like the "Unite the Right" rally in Charlottesville. We find that articles that attract a substantial number of hateful comments have different linguistic characteristics when compared to articles that do not attract hateful comments.
arXiv Detail & Related papers (2020-05-16T09:59:01Z)
Echo Chambers on Social Media: A comparative analysis [64.2256216637683]
We introduce an operational definition of echo chambers and perform a massive comparative analysis on 1B pieces of contents produced by 1M users on four social media platforms. We infer the leaning of users about controversial topics and reconstruct their interaction networks by analyzing different features. We find support for the hypothesis that platforms implementing news feed algorithms like Facebook may elicit the emergence of echo-chambers.
arXiv Detail & Related papers (2020-04-20T20:00:27Z)
The Pushshift Reddit Dataset [1.5661920010658625]
Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception.
arXiv Detail & Related papers (2020-01-23T10:31:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.