HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid
- URL: http://arxiv.org/abs/2410.06370v2
- Date: Tue, 15 Oct 2024 20:23:13 GMT
- Title: HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid
- Authors: Hemank Lamba, Anton Abilov, Ke Zhang, Elizabeth M. Olson, Henry k. Dambanemuya, João c. Bárcia, David S. Batista, Christina Wille, Aoife Cahill, Joel Tetreault, Alex Jaimes,
- Abstract summary: HumVI is a dataset containing news articles in three languages (English, French, Arabic) containing instances of violent incidents categorized by the humanitarian sector they impact.
We provide benchmarks for the dataset, employing various deep learning architectures and techniques, including data augmentation and mask loss.
- Score: 6.0520837495927315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humanitarian organizations can enhance their effectiveness by analyzing data to discover trends, gather aggregated insights, manage their security risks, support decision-making, and inform advocacy and funding proposals. However, data about violent incidents with direct impact and relevance for humanitarian aid operations is not readily available. An automatic data collection and NLP-backed classification framework aligned with humanitarian perspectives can help bridge this gap. In this paper, we present HumVI - a dataset comprising news articles in three languages (English, French, Arabic) containing instances of different types of violent incidents categorized by the humanitarian sector they impact, e.g., aid security, education, food security, health, and protection. Reliable labels were obtained for the dataset by partnering with a data-backed humanitarian organization, Insecurity Insight. We provide multiple benchmarks for the dataset, employing various deep learning architectures and techniques, including data augmentation and mask loss, to address different task-related challenges, e.g., domain expansion. The dataset is publicly available at https://github.com/dataminr-ai/humvi-dataset.
Related papers
- A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies [0.0]
CrisisSpot is a method that captures complex relationships between textual and visual modalities.
IDEA captures both harmonious and contrasting patterns within the data to enhance multimodal interactions.
CrisisSpot achieved an average F1-score gain of 9.45% and 5.01% compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-11T13:51:46Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics [49.2719253711215]
This study introduces a novel approach to disaster text classification by enhancing a pre-trained Large Language Model (LLM)
Our methodology involves creating a comprehensive instruction dataset from disaster-related tweets, which is then used to fine-tune an open-source LLM.
This fine-tuned model can classify multiple aspects of disaster-related information simultaneously, such as the type of event, informativeness, and involvement of human aid.
arXiv Detail & Related papers (2024-06-16T23:01:10Z) - When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - A New Task and Dataset on Detecting Attacks on Human Rights Defenders [68.45906430323156]
We propose a new dataset for detecting Attacks on Human Rights Defenders (HRDsAttack) consisting of crowdsourced annotations on 500 online news articles.
The annotations include fine-grained information about the type and location of the attacks, as well as information about the victim(s)
We demonstrate the usefulness of the dataset by using it to train and evaluate baseline models on several sub-tasks to predict the annotated characteristics.
arXiv Detail & Related papers (2023-06-30T14:20:06Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - HumSet: Dataset of Multilingual Information Extraction and
Classification for Humanitarian Crisis Response [5.057850174013127]
HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community.
The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe.
HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks.
arXiv Detail & Related papers (2022-10-10T11:28:07Z) - Data Poisoning Attacks and Defenses to Crowdsourcing Systems [26.147716118854614]
We show that crowdsourcing is vulnerable to data poisoning attacks.
malicious clients provide carefully crafted data to corrupt the aggregated data.
We propose two defenses to reduce the impact of malicious clients.
arXiv Detail & Related papers (2021-02-18T06:03:48Z) - Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks,
and Defenses [150.64470864162556]
This work systematically categorizes and discusses a wide range of dataset vulnerabilities and exploits.
In addition to describing various poisoning and backdoor threat models and the relationships among them, we develop their unified taxonomy.
arXiv Detail & Related papers (2020-12-18T22:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.