GGPONC: A Corpus of German Medical Text with Rich Metadata Based on
Clinical Practice Guidelines
- URL: http://arxiv.org/abs/2007.06400v2
- Date: Mon, 16 Nov 2020 09:22:02 GMT
- Title: GGPONC: A Corpus of German Medical Text with Rich Metadata Based on
Clinical Practice Guidelines
- Authors: Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer,
Markus Follmann, Jan Philipp Sachs, Udo Hahn and Matthieu-P. Schapranow
- Abstract summary: GGPONC is a freely distributable German language corpus based on clinical practice guidelines for oncology.
GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield.
By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language.
- Score: 4.370297546680015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The lack of publicly accessible text corpora is a major obstacle for progress
in natural language processing. For medical applications, unfortunately, all
language communities other than English are low-resourced. In this work, we
present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely
distributable German language corpus based on clinical practice guidelines for
oncology. This corpus is one of the largest ever built from German medical
documents. Unlike clinical documents, clinical guidelines do not contain any
patient-related information and can therefore be used without data protection
restrictions. Moreover, GGPONC is the first corpus for the German language
covering diverse conditions in a large medical subfield and provides a variety
of metadata, such as literature references and evidence levels. By applying and
evaluating existing medical information extraction pipelines for German text,
we are able to draw comparisons for the use of medical language to other
corpora, medical and non-medical ones.
Related papers
- ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking.
It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Cross-lingual Approaches for the Detection of Adverse Drug Reactions in
German from a Patient's Perspective [3.8233498951276403]
We present the first corpus for German Adverse Drug Reaction detection in patient-generated content.
The data consists of 4,169 binary annotated documents from a German patient forum.
arXiv Detail & Related papers (2022-08-03T12:52:01Z) - A Medical Information Extraction Workbench to Process German Clinical
Text [5.519657218427976]
We introduce a workbench: a collection of German clinical text processing models.
The models are trained on a de-identified corpus of German nephrology reports.
Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems.
arXiv Detail & Related papers (2022-07-08T13:19:19Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z) - SemClinBr -- a multi institutional and multi specialty semantically
annotated corpus for Portuguese clinical NLP tasks [0.7311642662742726]
SemClinBr is a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
This work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
arXiv Detail & Related papers (2020-01-27T20:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.