CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts
- URL: http://arxiv.org/abs/2309.05494v3
- Date: Thu, 11 Apr 2024 05:25:17 GMT
- Title: CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts
- Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera,
- Abstract summary: Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature.
This study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets.
- Score: 3.690904966341072
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to the effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. The models are publicly available at: https://huggingface.co/crisistransformers
Related papers
- CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics [49.2719253711215]
This study introduces a novel approach to disaster text classification by enhancing a pre-trained Large Language Model (LLM)
Our methodology involves creating a comprehensive instruction dataset from disaster-related tweets, which is then used to fine-tune an open-source LLM.
This fine-tuned model can classify multiple aspects of disaster-related information simultaneously, such as the type of event, informativeness, and involvement of human aid.
arXiv Detail & Related papers (2024-06-16T23:01:10Z) - Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts [3.690904966341072]
Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse.
Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness.
We propose multi-lingual sentence encoders that embed crisis-related social media texts for over 50 languages.
arXiv Detail & Related papers (2024-03-25T10:44:38Z) - CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster
Tweet Classification [51.58605842457186]
We present a fine-grained disaster tweet classification model under the semi-supervised, few-shot learning setting.
Our model, CrisisMatch, effectively classifies tweets into fine-grained classes of interest using few labeled data and large amounts of unlabeled data.
arXiv Detail & Related papers (2023-10-23T07:01:09Z) - DeCrisisMB: Debiased Semi-Supervised Learning for Crisis Tweet
Classification via Memory Bank [52.20298962359658]
In crisis events, people often use social media platforms such as Twitter to disseminate information about the situation, warnings, advice, and support.
fully-supervised approaches require annotating vast amounts of data and are impractical due to limited response time.
Semi-supervised models can be biased, performing moderately well for certain classes while performing extremely poorly for others.
We propose a simple but effective debiasing method, DeCrisisMB, that utilizes a Memory Bank to store and perform equal sampling for generated pseudo-labels from each class at each training.
arXiv Detail & Related papers (2023-10-23T05:25:51Z) - Coping with low data availability for social media crisis message
categorisation [3.0255457622022495]
This thesis focuses on addressing the challenge of low data availability when categorising crisis messages for emergency response.
It first presents domain adaptation as a solution for this problem, which involves learning a categorisation model from annotated data from past crisis events.
In many-to-many adaptation, where the model is trained on multiple past events and adapted to multiple ongoing events, a multi-task learning approach is proposed.
arXiv Detail & Related papers (2023-05-26T19:08:24Z) - CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and
Summarization [62.77066949111921]
This paper presents CrisisLTLSum, the largest dataset of local crisis event timelines available to date.
CrisisLTLSum contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms.
Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks.
arXiv Detail & Related papers (2022-10-25T17:32:40Z) - Cross-Lingual and Cross-Domain Crisis Classification for Low-Resource
Scenarios [4.147346416230273]
We study the task of automatically classifying messages related to crisis events by leveraging cross-language and cross-domain labeled data.
Our goal is to make use of labeled data from high-resource languages to classify messages from other (low-resource) languages and/or of new (previously unseen) types of crisis situations.
Our empirical findings show that it is indeed possible to leverage data from crisis events in English to classify the same type of event in other languages, such as Spanish and Italian.
arXiv Detail & Related papers (2022-09-05T20:57:23Z) - Event-Related Bias Removal for Real-time Disaster Events [67.2965372987723]
Social media has become an important tool to share information about crisis events such as natural disasters and mass attacks.
Detecting actionable posts that contain useful information requires rapid analysis of huge volume of data in real-time.
We train an adversarial neural model to remove latent event-specific biases and improve the performance on tweet importance classification.
arXiv Detail & Related papers (2020-11-02T02:03:07Z) - CrisisBERT: a Robust Transformer for Crisis Classification and
Contextual Crisis Embedding [2.7718973516070684]
We propose an end-to-end transformer-based model for two crisis classification tasks, namely crisis detection and crisis recognition.
We also proposed Crisis2Vec, an attention-based, document-level contextual embedding architecture for crisis embedding.
arXiv Detail & Related papers (2020-05-11T09:57:24Z) - CrisisBench: Benchmarking Crisis-related Social Media Datasets for
Humanitarian Information Processing [13.11283003017537]
We consolidate eight human-annotated datasets and provide 166.1k and 141.5k tweets for textitinformativeness and textithumanitarian classification tasks.
We provide benchmarks for both binary and multiclass classification tasks using several deep learning architecrures including, CNN, fastText, and transformers.
arXiv Detail & Related papers (2020-04-14T19:51:04Z) - PALM: Pre-training an Autoencoding&Autoregressive Language Model for
Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation.
This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus.
An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.