The Pushshift Telegram Dataset
- URL: http://arxiv.org/abs/2001.08438v1
- Date: Thu, 23 Jan 2020 10:37:33 GMT
- Title: The Pushshift Telegram Dataset
- Authors: Jason Baumgartner, Savvas Zannettou, Megan Squire, Jeremy Blackburn
- Abstract summary: We present a dataset from one such mobile messaging platform: Telegram.
Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users.
- Score: 1.7109522466982476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Messaging platforms, especially those with a mobile focus, have become
increasingly ubiquitous in society. These mobile messaging platforms can have
deceivingly large user bases, and in addition to being a way for people to stay
in touch, are often used to organize social movements, as well as a place for
extremists and other ne'er-do-well to congregate. In this paper, we present a
dataset from one such mobile messaging platform: Telegram. Our dataset is made
up of over 27.8K channels and 317M messages from 2.2M unique users. To the best
of our knowledge, our dataset comprises the largest and most complete of its
kind. In addition to the raw data, we also provide the source code used to
collect it, allowing researchers to run their own data collection instance. We
believe the Pushshift Telegram dataset can help researchers from a variety of
disciplines interested in studying online social movements, protests, political
extremism, and disinformation.
Related papers
- WildChat: 1M ChatGPT Interaction Logs in the Wild [88.05964311416717]
WildChat is a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns.
In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses.
arXiv Detail & Related papers (2024-05-02T17:00:02Z) - WhatsApp Explorer: A Data Donation Tool To Facilitate Research on WhatsApp [1.2507543279181124]
This paper introduces WhatsApp Explorer, a tool designed to enable WhatsApp data collection on a large scale.
We discuss protocols for data collection, including potential sampling approaches, and explain why our tool (and adjoining protocol) arguably allow researchers to collect WhatsApp data in an ethical and legal manner, at scale.
arXiv Detail & Related papers (2024-03-29T13:30:29Z) - An Exploratory Analysis of COVID Bot vs Human Disinformation
Dissemination stemming from the Disinformation Dozen on Telegram [5.494111035517598]
The COVID-19 pandemic of 2021 led to a worldwide health crisis that was accompanied by an infodemic.
A group of 12 social media personalities, dubbed the Disinformation Dozen", were identified as key in spreading disinformation regarding the COVID-19 virus, treatments, and vaccines.
This study focuses on the spread of disinformation propagated by this group on Telegram, a mobile messaging and social media platform.
arXiv Detail & Related papers (2024-02-22T01:10:11Z) - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - TGDataset: a Collection of Over One Hundred Thousand Telegram Channels [69.22187804798162]
This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages.
We analyze the languages spoken within our dataset and the topic covered by English channels.
In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
arXiv Detail & Related papers (2023-03-09T15:42:38Z) - A Hierarchical Network-Oriented Analysis of User Participation in
Misinformation Spread on WhatsApp [0.9774299772405469]
We present a hierarchical network-oriented characterization of the users engaged in misinformation spread on WhatsApp.
Our study offers valuable insights into how WhatsApp users leverage the underlying network connecting different groups to gain large reach in the spread of misinformation on the platform.
arXiv Detail & Related papers (2021-09-22T00:00:02Z) - JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action,
Social Group and Activity Detection [54.696819174421584]
We introduce JRDB-Act, a multi-modal dataset that reflects a real distribution of human daily life actions in a university campus environment.
JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels.
JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene.
arXiv Detail & Related papers (2021-06-16T14:43:46Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z) - PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions
on WhatsApp Messenger [1.2301855531996841]
WhatsApp Messenger is one of the most popular channels for spreading information with a current reach of more than 180 countries and 2 billion people.
In the recent past, several countries have witnessed its effectiveness and influence in political and social campaigns.
We observe a high surge in information and propaganda flow during election campaigning.
arXiv Detail & Related papers (2020-10-26T00:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.