MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction
- URL: http://arxiv.org/abs/2405.18015v2
- Date: Fri, 22 Nov 2024 21:40:35 GMT
- Title: MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction
- Authors: Xiang Dai, Sarvnaz Karimi, Abeed Sarker, Ben Hachey, Cecile Paris,
- Abstract summary: Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources.
Most datasets or shared tasks focus on extracting ADEs from a particular type of text.
Domain generalisation - the ability of a machine learning model to perform well on new, unseen domains (text types) - is under-explored.
We build a benchmark for adverse drug event extraction, which we named MultiADE.
- Score: 11.458594744457521
- License:
- Abstract: Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources, such as electronic health records, medical literature, social media and search engine logs. Over the years, many datasets have been created, and shared tasks have been organised to facilitate active adverse event surveillance. However, most - if not all - datasets or shared tasks focus on extracting ADEs from a particular type of text. Domain generalisation - the ability of a machine learning model to perform well on new, unseen domains (text types) - is under-explored. Given the rapid advancements in natural language processing, one unanswered question is how far we are from having a single ADE extraction model that is effective on various types of text, such as scientific literature and social media posts. We contribute to answering this question by building a multi-domain benchmark for adverse drug event extraction, which we named MultiADE. The new benchmark comprises several existing datasets sampled from different text types and our newly created dataset - CADECv2, which is an extension of CADEC, covering online posts regarding more diverse drugs than CADEC. Our new dataset is carefully annotated by human annotators following detailed annotation guidelines. Our benchmark results show that the generalisation of the trained models is far from perfect, making it infeasible to be deployed to process different types of text. In addition, although intermediate transfer learning is a promising approach to utilising existing resources, further investigation is needed on methods of domain adaptation, particularly cost-effective methods to select useful training instances. The newly created CADECv2 and the scripts for building the benchmark are publicly available at CSIRO's Data Portal.
Related papers
- Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks [4.06061049778407]
Event data, or structured records of who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics.
This paper describes a bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP)
We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS.
arXiv Detail & Related papers (2023-04-03T19:51:00Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text
Generation in E-commerce Title and Review Summarization [14.414693156937782]
We propose a novel domain-specific generative pre-training (DS-GPT) method for text generation.
We apply it to the product titleand review summarization problems on E-commerce mobile display.
arXiv Detail & Related papers (2021-12-15T19:02:49Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - A Span Extraction Approach for Information Extraction on Visually-Rich
Documents [2.3131309703965135]
We present a new approach to improve the capability of language model pre-training on visually-rich documents (VRDs)
Firstly, we introduce a new IE model that is query-based and employs the span extraction formulation instead of the commonly used sequence labelling approach.
We also propose a new training task which focuses on modelling the relationships between semantic entities within a document.
arXiv Detail & Related papers (2021-06-02T06:50:04Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Detecting Ongoing Events Using Contextual Word and Sentence Embeddings [110.83289076967895]
This paper introduces the Ongoing Event Detection (OED) task.
The goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current.
Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an OED system.
arXiv Detail & Related papers (2020-07-02T20:44:05Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.