HumSet: Dataset of Multilingual Information Extraction and
Classification for Humanitarian Crisis Response
- URL: http://arxiv.org/abs/2210.04573v1
- Date: Mon, 10 Oct 2022 11:28:07 GMT
- Title: HumSet: Dataset of Multilingual Information Extraction and
Classification for Humanitarian Crisis Response
- Authors: Selim Fekih, Nicol\`o Tamagnone, Benjamin Minixhofer, Ranjan Shrestha,
Ximena Contla, Ewan Oglethorpe, Navid Rekabsaz
- Abstract summary: HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community.
The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe.
HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks.
- Score: 5.057850174013127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Timely and effective response to humanitarian crises requires quick and
accurate analysis of large amounts of text data - a process that can highly
benefit from expert - assisted NLP systems trained on validated and annotated
data in the humanitarian response domain. To enable creation of such NLP
systems, we introduce and release HumSet, a novel and rich multilingual dataset
of humanitarian response documents annotated by experts in the humanitarian
response community. The dataset provides documents in three languages (English,
French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021
across the globe. For each document, HumSet provides selected snippets
(entries) as well as assigned classes to each entry annotated using common
humanitarian information analysis frameworks. HumSet also provides novel and
challenging entry extraction and multi-label entry classification tasks. In
this paper, we take a first step towards approaching these tasks and conduct a
set of experiments on Pre-trained Language Models (PLM) to establish strong
baselines for future research in this domain. The dataset is available at The
dataset is available at https: //blog.thedeep.io/humset/.
Related papers
- Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach [0.7806050661713976]
The Sustainable Development Goals were formulated by the United Nations in 2015 to address these global challenges by 2030.
Natural language processing techniques can help uncover discussions on SDGs within research literature.
We propose a completely automated pipeline to fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs.
arXiv Detail & Related papers (2024-11-05T09:37:23Z) - HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid [6.0520837495927315]
HumVI is a dataset containing news articles in three languages (English, French, Arabic) containing instances of violent incidents categorized by the humanitarian sector they impact.
We provide benchmarks for the dataset, employing various deep learning architectures and techniques, including data augmentation and mask loss.
arXiv Detail & Related papers (2024-10-08T21:08:13Z) - HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent [6.764665650605542]
We introduce HR-Multiwoz, a fully-labeled dataset of 550 conversations spanning 10 HR domains.
It is the first labeled open-sourced conversation dataset in the HR domain for NLP research.
It provides a detailed recipe for the data generation procedure along with data analysis and human evaluations.
arXiv Detail & Related papers (2024-02-01T21:10:44Z) - U-DIADS-Bib: a full and few-shot pixel-precise dataset for document
layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities.
We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - HumanBench: Towards General Human-centric Perception with Projector
Assisted Pretraining [75.1086193340286]
It is desirable to have a general pretrain model for versatile human-centric downstream tasks.
We propose a textbfHumanBench based on existing datasets to evaluate on the common ground the generalization abilities of different pretraining methods.
Our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets.
arXiv Detail & Related papers (2023-03-10T02:57:07Z) - Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval [19.000263567641817]
We present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR)
The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical.
We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain.
arXiv Detail & Related papers (2022-03-07T13:20:46Z) - Addressing Issues of Cross-Linguality in Open-Retrieval Question
Answering Systems For Emergent Domains [67.99403521976058]
We demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19.
Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable.
We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting.
arXiv Detail & Related papers (2022-01-26T19:27:32Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents.
We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs.
We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.