The Pushshift Telegram Dataset
- URL: http://arxiv.org/abs/2001.08438v1
- Date: Thu, 23 Jan 2020 10:37:33 GMT
- Title: The Pushshift Telegram Dataset
- Authors: Jason Baumgartner, Savvas Zannettou, Megan Squire, Jeremy Blackburn
- Abstract summary: We present a dataset from one such mobile messaging platform: Telegram.
Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users.
- Score: 1.7109522466982476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Messaging platforms, especially those with a mobile focus, have become
increasingly ubiquitous in society. These mobile messaging platforms can have
deceivingly large user bases, and in addition to being a way for people to stay
in touch, are often used to organize social movements, as well as a place for
extremists and other ne'er-do-well to congregate. In this paper, we present a
dataset from one such mobile messaging platform: Telegram. Our dataset is made
up of over 27.8K channels and 317M messages from 2.2M unique users. To the best
of our knowledge, our dataset comprises the largest and most complete of its
kind. In addition to the raw data, we also provide the source code used to
collect it, allowing researchers to run their own data collection instance. We
believe the Pushshift Telegram dataset can help researchers from a variety of
disciplines interested in studying online social movements, protests, political
extremism, and disinformation.
Related papers
- pytopicgram: A library for data extraction and topic modeling from Telegram channels [0.0]
pytopicgram is a Python library that helps researchers collect, organize, and analyze these Telegram messages.
pytopicgram allows users to understand how content spreads and how audiences interact on Telegram.
arXiv Detail & Related papers (2025-02-07T12:41:47Z) - Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.
The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z) - TelegramScrap: A comprehensive tool for scraping Telegram data [0.0]
TelegramScrap is a tool for extracting and analyzing data from Telegram channels and groups.
This white paper outlines the tool's development, capabilities, and applications in academic and scientific research.
arXiv Detail & Related papers (2024-12-21T21:46:56Z) - Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data)
The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z) - WildChat: 1M ChatGPT Interaction Logs in the Wild [88.05964311416717]
WildChat is a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns.
In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses.
arXiv Detail & Related papers (2024-05-02T17:00:02Z) - An Exploratory Analysis of COVID Bot vs Human Disinformation
Dissemination stemming from the Disinformation Dozen on Telegram [5.494111035517598]
The COVID-19 pandemic of 2021 led to a worldwide health crisis that was accompanied by an infodemic.
A group of 12 social media personalities, dubbed the Disinformation Dozen", were identified as key in spreading disinformation regarding the COVID-19 virus, treatments, and vaccines.
This study focuses on the spread of disinformation propagated by this group on Telegram, a mobile messaging and social media platform.
arXiv Detail & Related papers (2024-02-22T01:10:11Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - TGDataset: Collecting and Exploring the Largest Telegram Channels Dataset [57.2282378772772]
This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages.
We analyze the languages spoken within our dataset and the topic covered by English channels.
In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
arXiv Detail & Related papers (2023-03-09T15:42:38Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z) - PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions
on WhatsApp Messenger [1.2301855531996841]
WhatsApp Messenger is one of the most popular channels for spreading information with a current reach of more than 180 countries and 2 billion people.
In the recent past, several countries have witnessed its effectiveness and influence in political and social campaigns.
We observe a high surge in information and propaganda flow during election campaigning.
arXiv Detail & Related papers (2020-10-26T00:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.