A Medical Information Extraction Workbench to Process German Clinical
Text
- URL: http://arxiv.org/abs/2207.03885v1
- Date: Fri, 8 Jul 2022 13:19:19 GMT
- Title: A Medical Information Extraction Workbench to Process German Clinical
Text
- Authors: Roland Roller, Laura Seiffe, Ammer Ayach, Sebastian M\"oller, Oliver
Marten, Michael Mikhailov, Christoph Alt, Danilo Schmidt, Fabian Halleck,
Marcel Naik, Wiebke Duettmann and Klemens Budde
- Abstract summary: We introduce a workbench: a collection of German clinical text processing models.
The models are trained on a de-identified corpus of German nephrology reports.
Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems.
- Score: 5.519657218427976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: In the information extraction and natural language processing
domain, accessible datasets are crucial to reproduce and compare results.
Publicly available implementations and tools can serve as benchmark and
facilitate the development of more complex applications. However, in the
context of clinical text processing the number of accessible datasets is scarce
-- and so is the number of existing tools. One of the main reasons is the
sensitivity of the data. This problem is even more evident for non-English
languages.
Approach: In order to address this situation, we introduce a workbench: a
collection of German clinical text processing models. The models are trained on
a de-identified corpus of German nephrology reports.
Result: The presented models provide promising results on in-domain data.
Moreover, we show that our models can be also successfully applied to other
biomedical text in German. Our workbench is made publicly available so it can
be used out of the box, as a benchmark or transferred to related problems.
Related papers
- Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Factuality Detection using Machine Translation -- a Use Case for German
Clinical Text [45.875111164923545]
This work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed.
arXiv Detail & Related papers (2023-08-17T07:24:06Z) - Cross-lingual Argument Mining in the Medical Domain [6.0158981171030685]
We show how to perform Argument Mining (AM) in medical texts for which no annotated data is available.
Our work shows that automatically translating and projecting annotations (data-transfer) from English to a given target language is an effective way to generate annotated data.
We also show how the automatically generated data in Spanish can also be used to improve results in the original English monolingual setting.
arXiv Detail & Related papers (2023-01-25T11:21:12Z) - RuMedBench: A Russian Medical Language Understanding Benchmark [58.99199480170909]
The paper describes the open Russian medical language understanding benchmark covering several task types.
We prepare the unified format labeling, data split, and evaluation metrics for new tasks.
A single-number metric expresses a model's ability to cope with the benchmark.
arXiv Detail & Related papers (2022-01-17T16:23:33Z) - GERNERMED -- An Open German Medical NER Model [0.7310043452300736]
Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data.
In this work, we present GERNERMED, the first open, neural NLP model for NER tasks dedicated to detect medical entity types in German text data.
arXiv Detail & Related papers (2021-09-24T17:53:47Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - A Practical Approach towards Causality Mining in Clinical Text using
Active Transfer Learning [2.6125458645126907]
Causality mining is an active research area, which requires the application of state-of-the-art natural language processing techniques.
This research work is to create a framework, which can convert clinical text into causal knowledge.
arXiv Detail & Related papers (2020-12-10T06:51:13Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.