Related papers: The Pushshift Telegram Dataset

The Pushshift Telegram Dataset

URL: http://arxiv.org/abs/2001.08438v1
Date: Thu, 23 Jan 2020 10:37:33 GMT
Title: The Pushshift Telegram Dataset
Authors: Jason Baumgartner, Savvas Zannettou, Megan Squire, Jeremy Blackburn
Abstract summary: We present a dataset from one such mobile messaging platform: Telegram. Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users.
Score: 1.7109522466982476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Messaging platforms, especially those with a mobile focus, have become increasingly ubiquitous in society. These mobile messaging platforms can have deceivingly large user bases, and in addition to being a way for people to stay in touch, are often used to organize social movements, as well as a place for extremists and other ne'er-do-well to congregate. In this paper, we present a dataset from one such mobile messaging platform: Telegram. Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users. To the best of our knowledge, our dataset comprises the largest and most complete of its kind. In addition to the raw data, we also provide the source code used to collect it, allowing researchers to run their own data collection instance. We believe the Pushshift Telegram dataset can help researchers from a variety of disciplines interested in studying online social movements, protests, political extremism, and disinformation.

Related papers

pytopicgram: A library for data extraction and topic modeling from Telegram channels [0.0]
pytopicgram is a Python library that helps researchers collect, organize, and analyze these Telegram messages. pytopicgram allows users to understand how content spreads and how audiences interact on Telegram.
arXiv Detail & Related papers (2025-02-07T12:41:47Z)
Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users. The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z)
TelegramScrap: A comprehensive tool for scraping Telegram data [0.0]
TelegramScrap is a tool for extracting and analyzing data from Telegram channels and groups. This white paper outlines the tool's development, capabilities, and applications in academic and scientific research.
arXiv Detail & Related papers (2024-12-21T21:46:56Z)
Labeled Datasets for Research on Information Operations [71.34999856621306]
We present new labeled datasets about 26 campaigns, which contain both IO posts verified by a social media platform and over 13M posts by 303k accounts that discussed similar topics in the same time frames (control data) The datasets will facilitate the study of narratives, network interactions, and engagement strategies employed by coordinated accounts across various campaigns and countries.
arXiv Detail & Related papers (2024-11-15T22:15:01Z)
WildChat: 1M ChatGPT Interaction Logs in the Wild [88.05964311416717]
WildChat is a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses.
arXiv Detail & Related papers (2024-05-02T17:00:02Z)
WhatsApp Explorer: A Data Donation Tool To Facilitate Research on WhatsApp [1.2507543279181124]
This paper introduces WhatsApp Explorer, a tool designed to enable WhatsApp data collection on a large scale. We discuss protocols for data collection, including potential sampling approaches, and explain why our tool (and adjoining protocol) arguably allow researchers to collect WhatsApp data in an ethical and legal manner, at scale.
arXiv Detail & Related papers (2024-03-29T13:30:29Z)
An Exploratory Analysis of COVID Bot vs Human Disinformation Dissemination stemming from the Disinformation Dozen on Telegram [5.494111035517598]
The COVID-19 pandemic of 2021 led to a worldwide health crisis that was accompanied by an infodemic. A group of 12 social media personalities, dubbed the Disinformation Dozen", were identified as key in spreading disinformation regarding the COVID-19 virus, treatments, and vaccines. This study focuses on the spread of disinformation propagated by this group on Telegram, a mobile messaging and social media platform.
arXiv Detail & Related papers (2024-02-22T01:10:11Z)
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs) This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z)
ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information. To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles. Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z)
TGDataset: a Collection of Over One Hundred Thousand Telegram Channels [69.22187804798162]
This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages. We analyze the languages spoken within our dataset and the topic covered by English channels. In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
arXiv Detail & Related papers (2023-03-09T15:42:38Z)
A Hierarchical Network-Oriented Analysis of User Participation in Misinformation Spread on WhatsApp [0.9774299772405469]
We present a hierarchical network-oriented characterization of the users engaged in misinformation spread on WhatsApp. Our study offers valuable insights into how WhatsApp users leverage the underlying network connecting different groups to gain large reach in the spread of misinformation on the platform.
arXiv Detail & Related papers (2021-09-22T00:00:02Z)
Named Entity Recognition for Social Media Texts with Semantic Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts. We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z)
PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions on WhatsApp Messenger [1.2301855531996841]
WhatsApp Messenger is one of the most popular channels for spreading information with a current reach of more than 180 countries and 2 billion people. In the recent past, several countries have witnessed its effectiveness and influence in political and social campaigns. We observe a high surge in information and propaganda flow during election campaigning.
arXiv Detail & Related papers (2020-10-26T00:35:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.