BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on
Twitter
- URL: http://arxiv.org/abs/2105.01331v3
- Date: Tue, 17 Oct 2023 07:30:40 GMT
- Title: BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on
Twitter
- Authors: Hasan Kemik, Nusret \"Ozate\c{s}, Meysam Asgari-Chenaghlu, Yang Li,
Erik Cambria
- Abstract summary: We propose a labeled dataset for topic detection that contains 17 million tweets.
These Tweets are collected from 25 May 2020 to 21 August 2020 that covers 89 days from start of this incident.
- Score: 25.881740515679393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protection of human rights is one of the most important problems of our
world. In this paper, our aim is to provide a dataset which covers one of the
most significant human rights contradiction in recent months affected the whole
world, George Floyd incident. We propose a labeled dataset for topic detection
that contains 17 million tweets. These Tweets are collected from 25 May 2020 to
21 August 2020 that covers 89 days from start of this incident. We labeled the
dataset by monitoring most trending news topics from global and local
newspapers. Apart from that, we present two baselines, TF-IDF and LDA. We
evaluated the results of these two methods with three different k values for
metrics of precision, recall and f1-score. The collected dataset is available
at https://github.com/MeysamAsgariC/BLMT.
Related papers
- Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - A New Task and Dataset on Detecting Attacks on Human Rights Defenders [68.45906430323156]
We propose a new dataset for detecting Attacks on Human Rights Defenders (HRDsAttack) consisting of crowdsourced annotations on 500 online news articles.
The annotations include fine-grained information about the type and location of the attacks, as well as information about the victim(s)
We demonstrate the usefulness of the dataset by using it to train and evaluate baseline models on several sub-tasks to predict the annotated characteristics.
arXiv Detail & Related papers (2023-06-30T14:20:06Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection [4.799822253865053]
We construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels.
The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes.
arXiv Detail & Related papers (2022-10-11T12:25:26Z) - CovidMis20: COVID-19 Misinformation Detection System on Twitter Tweets
using Deep Learning Models [1.4085013201980032]
This research presents the CovidMis20 dataset (COVID-19 Misinformation 2020 dataset), which consists of 1,375,592 tweets collected from February to July 2020.
This research was conducted using Bi-LSTM deep learning and an ensemble CNN+Bi-GRU for fake news detection.
arXiv Detail & Related papers (2022-09-13T00:43:44Z) - DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally
Spreading Out Disinformation [72.18912216025029]
We present DisinfoMeme to help detect disinformation memes.
The dataset contains memes mined from Reddit covering three current topics: the COVID-19 pandemic, the Black Lives Matter movement, and veganism/vegetarianism.
arXiv Detail & Related papers (2022-05-25T09:54:59Z) - Twitter Dataset on the Russo-Ukrainian War [68.713984286035]
We have initiated an ongoing dataset acquisition from Twitter API.
The dataset has reached the amount of 57.3 million tweets, originating from 7.7 million users.
We apply an initial volume and sentiment analysis, while the dataset can be used to further exploratory investigation towards topic analysis, hate speech, propaganda recognition, or even show potential malicious entities like botnets.
arXiv Detail & Related papers (2022-04-07T12:33:06Z) - Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal
Misinformation [83.2079454464572]
This paper describes our approach to the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program.
We collect Twitter-COMMs, a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles.
We train our approach, based on the state-of-the-art CLIP model, leveraging automatically generated random and hard negatives.
arXiv Detail & Related papers (2021-12-16T03:37:20Z) - Extracting Feelings of People Regarding COVID-19 by Social Network
Mining [0.0]
dataset of COVID-related tweets in English language is collected.
More than two million tweets from March 23 to June 23 of 2020 are analyzed.
arXiv Detail & Related papers (2021-10-12T16:45:33Z) - High-level Approaches to Detect Malicious Political Activity on Twitter [0.0]
We investigate a data snapshot taken on May 2020, with around 5 million accounts and over 120 million tweets.
The analyzed time period stretches from August 2019 to May 2020, with a focus on the Portuguese elections of October 6th, 2019.
We learn that Twitter's suspension patterns are not adequate to the type of political trolling found in the Portuguese Twittersphere.
arXiv Detail & Related papers (2021-02-04T22:54:44Z) - Large Arabic Twitter Dataset on COVID-19 [0.7734726150561088]
coronavirus disease (COVID-19), emerged late December 2019 in China, is now rapidly spreading across the globe.
The number of global confirmed cases has passed two millions and half with over 180,000 fatalities.
This work describes the first Arabic tweets dataset on COVID-19 that we have been collecting since January 1st, 2020.
arXiv Detail & Related papers (2020-04-09T01:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.