TGDataset: a Collection of Over One Hundred Thousand Telegram Channels
- URL: http://arxiv.org/abs/2303.05345v1
- Date: Thu, 9 Mar 2023 15:42:38 GMT
- Title: TGDataset: a Collection of Over One Hundred Thousand Telegram Channels
- Authors: Massimo La Morgia, Alessandro Mei, Alberto Maria Mongardini
- Abstract summary: This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages.
We analyze the languages spoken within our dataset and the topic covered by English channels.
In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
- Score: 69.22187804798162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Telegram is one of the most popular instant messaging apps in today's digital
age. In addition to providing a private messaging service, Telegram, with its
channels, represents a valid medium for rapidly broadcasting content to a large
audience (COVID-19 announcements), but, unfortunately, also for disseminating
radical ideologies and coordinating attacks (Capitol Hill riot). This paper
presents the TGDataset, a new dataset that includes 120,979 Telegram channels
and over 400 million messages, making it the largest collection of Telegram
channels to the best of our knowledge. After a brief introduction to the data
collection process, we analyze the languages spoken within our dataset and the
topic covered by English channels. Finally, we discuss some use cases in which
our dataset can be extremely useful to understand better the Telegram
ecosystem, as well as to study the diffusion of questionable news. In addition
to the raw dataset, we released the scripts we used to analyze the dataset and
the list of channels belonging to the network of a new conspiracy theory called
Sabmyk.
Related papers
- Characterizing and Detecting Propaganda-Spreading Accounts on Telegram [7.759087666892532]
Information-based attacks on social media, such as disinformation campaigns and propaganda, are emerging cybersecurity threats.
We propose a novel mechanism for detecting propaganda that capitalizes on the relationship between legitimate user messages and propaganda replies.
Our method is faster, cheaper, and has a detection rate (97.6%) 11.6 percentage points higher than human moderators after seeing only one message from an account.
arXiv Detail & Related papers (2024-06-12T11:07:27Z) - YODAS: Youtube-Oriented Dataset for Audio and Speech [47.60574092241447]
YODAS is a large-scale, multilingual dataset comprising over 500k hours of speech data in more than 100 languages.
The labeled subsets, including manual or automatic subtitles, facilitate supervised model training.
YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license.
arXiv Detail & Related papers (2024-06-02T23:43:27Z) - Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram [5.161088104035108]
We study how 16 Russian media outlets interacted with and utilized 732 Telegram channels throughout 2022.
We show that news outlets not only propagate existing narratives through Telegram but that they source material from the messaging platform.
For example, across the websites in our study, between 2.3% (ura.news) and 26.7% (ukraina.ru) of articles discussed content that originated/resulted from activity on Telegram.
arXiv Detail & Related papers (2023-01-25T22:27:40Z) - Uncovering the Dark Side of Telegram: Fakes, Clones, Scams, and
Conspiracy Movements [67.39353554498636]
We perform a large-scale analysis of Telegram by collecting 35,382 different channels and over 130,000,000 messages.
We find some of the infamous activities also present on privacy-preserving services of the Dark Web, such as carding.
We propose a machine learning model that is able to identify fake channels with an accuracy of 86%.
arXiv Detail & Related papers (2021-11-26T14:53:31Z) - Cross-lingual COVID-19 Fake News Detection [54.125563009333995]
We make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English)
We propose a deep learning framework named CrossFake to jointly encode the cross-lingual news body texts and capture the news content.
Empirical results on our dataset demonstrate the effectiveness of CrossFake under the cross-lingual setting.
arXiv Detail & Related papers (2021-10-13T04:44:02Z) - Introducing an Abusive Language Classification Framework for Telegram to
Investigate the German Hater Community [0.6459215652021234]
We develop a framework that consists of (i) an abusive language classification model for German Telegram messages and (ii) a classification model for the hatefulness of Telegram channels.
For the channel classification model, we develop a method that combines channel specific content information coming from a topic model with a social graph to predict the hatefulness of channels.
As an additional output of the study, we release an annotated abusive language dataset containing 1,149 annotated Telegram messages.
arXiv Detail & Related papers (2021-09-15T14:58:46Z) - MTVR: Multilingual Moment Retrieval in Videos [89.24431389933703]
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.
The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles.
We propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages.
arXiv Detail & Related papers (2021-07-30T20:01:03Z) - Spot the conversation: speaker diarisation in the wild [108.61222789195209]
We propose an automatic audio-visual diarisation method for YouTube videos.
Second, we integrate our method into a semi-automatic dataset creation pipeline.
Third, we use this pipeline to create a large-scale diarisation dataset called VoxConverse.
arXiv Detail & Related papers (2020-07-02T15:55:54Z) - The Pushshift Telegram Dataset [1.7109522466982476]
We present a dataset from one such mobile messaging platform: Telegram.
Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users.
arXiv Detail & Related papers (2020-01-23T10:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.