A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
- URL: http://arxiv.org/abs/2403.18336v1
- Date: Wed, 27 Mar 2024 08:21:01 GMT
- Title: A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
- Authors: Lisa Raithel, Hui-Syuan Yeh, Shuntaro Yada, Cyril Grouin, Thomas Lavergne, Aurélie Névéol, Patrick Paroubek, Philippe Thomas, Tomohiro Nishiyama, Sebastian Möller, Eiji Aramaki, Yuji Matsumoto, Roland Roller, Pierre Zweigenbaum,
- Abstract summary: This work presents a multilingual corpus of texts concerning Adverse Drug Reactions gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese.
It contributes to the development of real-world multilingual language models for healthcare.
- Score: 17.40961028505384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.
Related papers
- Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Multilingual Clinical NER: Translation or Cross-lingual Transfer? [4.4924444466378555]
We show that translation-based methods can achieve similar performance to cross-lingual transfer.
We release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset.
arXiv Detail & Related papers (2023-06-07T12:31:07Z) - Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by
Diminishing Bias [38.26934474189853]
Unifying Cross-Lingual Medical Vision-Language Pre-Training (Med-UniC) designed to integrate multimodal medical data from English and Spanish.
Med-UniC reaches superior performance across 5 medical image tasks and 10 datasets encompassing over 30 diseases.
arXiv Detail & Related papers (2023-05-31T14:28:19Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Cross-lingual Approaches for the Detection of Adverse Drug Reactions in
German from a Patient's Perspective [3.8233498951276403]
We present the first corpus for German Adverse Drug Reaction detection in patient-generated content.
The data consists of 4,169 binary annotated documents from a German patient forum.
arXiv Detail & Related papers (2022-08-03T12:52:01Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z) - SemClinBr -- a multi institutional and multi specialty semantically
annotated corpus for Portuguese clinical NLP tasks [0.7311642662742726]
SemClinBr is a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
This work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
arXiv Detail & Related papers (2020-01-27T20:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.