Advancing Italian Biomedical Information Extraction with
Transformers-based Models: Methodological Insights and Multicenter Practical
Application
- URL: http://arxiv.org/abs/2306.05323v2
- Date: Mon, 15 Jan 2024 11:05:23 GMT
- Title: Advancing Italian Biomedical Information Extraction with
Transformers-based Models: Methodological Insights and Multicenter Practical
Application
- Authors: Claudio Crema, Tommaso Mario Buonocore, Silvia Fostinelli, Enea
Parimbelli, Federico Verde, Cira Fundar\`o, Marina Manera, Matteo Cotta
Ramusino, Marco Capelli, Alfredo Costa, Giuliano Binetti, Riccardo Bellazzi
and Alberto Redolfi
- Abstract summary: Information Extraction can help clinical practitioners overcome the limitation by using automated text-mining pipelines.
We created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model.
The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach.
- Score: 0.27027468002793437
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The introduction of computerized medical records in hospitals has reduced
burdensome activities like manual writing and information fetching. However,
the data contained in medical records are still far underutilized, primarily
because extracting data from unstructured textual medical records takes time
and effort. Information Extraction, a subfield of Natural Language Processing,
can help clinical practitioners overcome this limitation by using automated
text-mining pipelines. In this work, we created the first Italian
neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to
develop a Transformers-based model. Moreover, we collected and leveraged three
external independent datasets to implement an effective multicenter model, with
overall F1-score 84.77%, Precision 83.16%, Recall 86.44%. The lessons learned
are: (i) the crucial role of a consistent annotation process and (ii) a
fine-tuning strategy that combines classical methods with a "low-resource"
approach. This allowed us to establish methodological guidelines that pave the
way for Natural Language Processing studies in less-resourced languages.
Related papers
- GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models [1.123722364748134]
This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs)
The methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon.
The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98%.
arXiv Detail & Related papers (2024-05-31T02:53:22Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - A Unified Framework of Medical Information Annotation and Extraction for
Chinese Clinical Text [1.4841452489515765]
Current state-of-the-art (SOTA) NLP models are highly integrated with deep learning techniques.
This study presents an engineering framework of medical entity recognition, relation extraction and attribute extraction.
arXiv Detail & Related papers (2022-03-08T03:19:16Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Automated tabulation of clinical trial results: A joint entity and
relation extraction approach with transformer-based language representations [5.825190876052148]
This paper investigates automating evidence table generation by decomposing the problem across two language processing tasks.
We focus on the automatic tabulation of sentences from published RCT abstracts that report the practice outcomes.
To train and test these models, a new gold-standard corpus was developed, comprising almost 600 result sentences from six disease areas.
arXiv Detail & Related papers (2021-12-10T15:26:43Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Med7: a transferable clinical natural language processing model for
electronic health records [6.935142529928062]
We introduce a named-entity recognition model for clinical natural language processing.
The model is trained to recognise seven categories: drug names, route, frequency, dosage, strength, form, duration.
We evaluate the transferability of the developed model using the data from the Intensive Care Unit in the US to secondary care mental health records (CRIS) in the UK.
arXiv Detail & Related papers (2020-03-03T00:55:43Z) - Comparing Rule-based, Feature-based and Deep Neural Methods for
De-identification of Dutch Medical Records [4.339510167603376]
We construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare.
We test the generalizability of three de-identification methods across languages and domains.
arXiv Detail & Related papers (2020-01-16T09:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.