HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep
Learning Benchmarks
- URL: http://arxiv.org/abs/2104.03090v2
- Date: Thu, 8 Apr 2021 09:12:11 GMT
- Title: HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep
Learning Benchmarks
- Authors: Firoj Alam, Umair Qazi, Muhammad Imran, Ferda Ofli
- Abstract summary: Social media content is often too noisy for direct use in any application.
It is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making.
We present a new large-scale dataset with 77K human-labeled tweets, sampled from a pool of 24 million tweets across 19 disaster events.
- Score: 5.937482215664902
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social networks are widely used for information consumption and
dissemination, especially during time-critical events such as natural
disasters. Despite its significantly large volume, social media content is
often too noisy for direct use in any application. Therefore, it is important
to filter, categorize, and concisely summarize the available content to
facilitate effective consumption and decision-making. To address such issues
automatic classification systems have been developed using supervised modeling
approaches, thanks to the earlier efforts on creating labeled datasets.
However, existing datasets are limited in different aspects (e.g., size,
contains duplicates) and less suitable to support more advanced and data-hungry
deep learning models. In this paper, we present a new large-scale dataset with
~77K human-labeled tweets, sampled from a pool of ~24 million tweets across 19
disaster events that happened between 2016 and 2019. Moreover, we propose a
data collection and sampling pipeline, which is important for social media data
sampling for human annotation. We report multiclass classification results
using classic and deep learning (fastText and transformer) based models to set
the ground for future studies. The dataset and associated resources are
publicly available. https://crisisnlp.qcri.org/humaid_dataset.html
Related papers
- Active Learning for Identifying Disaster-Related Tweets: A Comparison with Keyword Filtering and Generic Fine-Tuning [0.25602836891933073]
It is difficult to identify the disaster-related posts among the large amounts of unstructured data available.
Previous methods often use keyword filtering, topic modelling or classification-based techniques to identify such posts.
This study investigates the potential of Active Learning (AL) for identifying disaster-related Tweets.
arXiv Detail & Related papers (2024-08-19T11:40:20Z) - From Random to Informed Data Selection: A Diversity-Based Approach to
Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning.
Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases.
This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z) - CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster
Tweet Classification [51.58605842457186]
We present a fine-grained disaster tweet classification model under the semi-supervised, few-shot learning setting.
Our model, CrisisMatch, effectively classifies tweets into fine-grained classes of interest using few labeled data and large amounts of unlabeled data.
arXiv Detail & Related papers (2023-10-23T07:01:09Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - The Surprising Performance of Simple Baselines for Misinformation
Detection [4.060731229044571]
We examine the performance of a broad set of modern transformer-based language models.
We present our framework as a baseline for creating and evaluating new methods for misinformation detection.
arXiv Detail & Related papers (2021-04-14T16:25:22Z) - Event-Related Bias Removal for Real-time Disaster Events [67.2965372987723]
Social media has become an important tool to share information about crisis events such as natural disasters and mass attacks.
Detecting actionable posts that contain useful information requires rapid analysis of huge volume of data in real-time.
We train an adversarial neural model to remove latent event-specific biases and improve the performance on tweet importance classification.
arXiv Detail & Related papers (2020-11-02T02:03:07Z) - I-AID: Identifying Actionable Information from Disaster-related Tweets [0.0]
Social media plays a significant role in disaster management by providing valuable data about affected people, donations and help requests.
We propose I-AID, a multimodel approach to automatically categorize tweets into multi-label information types.
Our results indicate that I-AID outperforms state-of-the-art approaches in terms of weighted average F1 score by +6% and +4% on the TREC-IS dataset and COVID-19 Tweets, respectively.
arXiv Detail & Related papers (2020-08-04T19:07:50Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.