Detecting automatically the layout of clinical documents to enhance the
performances of downstream natural language processing
- URL: http://arxiv.org/abs/2305.13817v1
- Date: Tue, 23 May 2023 08:38:33 GMT
- Title: Detecting automatically the layout of clinical documents to enhance the
performances of downstream natural language processing
- Authors: Christel G\'erardin, Perceval Wajsb\"urt, Basile Dura, Alice Calliger,
Alexandre Moucher, Xavier Tannier and Romain Bey
- Abstract summary: We designed an algorithm to process clinical PDF documents and extract only clinically relevant text.
The algorithm consists of several steps: initial text extraction using a PDF, followed by classification into such categories as body text, left notes, and footers.
Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections.
- Score: 53.797797404164946
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Objective:Develop and validate an algorithm for analyzing the layout of PDF
clinical documents to improve the performance of downstream natural language
processing tasks. Materials and Methods: We designed an algorithm to process
clinical PDF documents and extract only clinically relevant text. The algorithm
consists of several steps: initial text extraction using a PDF parser, followed
by classification into categories such as body text, left notes, and footers
using a Transformer deep neural network architecture, and finally an
aggregation step to compile the lines of a given label in the text. We
evaluated the technical performance of the body text extraction algorithm by
applying it to a random sample of documents that were annotated. Medical
performance was evaluated by examining the extraction of medical concepts of
interest from the text in their respective sections. Finally, we tested an
end-to-end system on a medical use case of automatic detection of acute
infection described in the hospital report. Results:Our algorithm achieved
per-line precision, recall, and F1 score of 98.4, 97.0, and 97.7, respectively,
for body line extraction. The precision, recall, and F1 score per document for
the acute infection detection algorithm were 82.54 (95CI 72.86-91.60), 85.24
(95CI 76.61-93.70), 83.87 (95CI 76, 92-90.08) with exploitation of the results
of the advanced body extraction algorithm, respectively. Conclusion:We have
developed and validated a system for extracting body text from clinical
documents in PDF format by identifying their layout. We were able to
demonstrate that this preprocessing allowed us to obtain better performances
for a common downstream task, i.e., the extraction of medical concepts in their
respective sections, thus proving the interest of this method on a clinical use
case.
Related papers
- Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Investigating Deep-Learning NLP for Automating the Extraction of
Oncology Efficacy Endpoints from Scientific Literature [0.0]
We have developed and optimised a framework to extract efficacy endpoints from text in scientific papers.
Our machine learning model predicts 25 classes associated with efficacy endpoints and leads to high F1 scores.
arXiv Detail & Related papers (2023-11-03T14:01:54Z) - Towards Unifying Anatomy Segmentation: Automated Generation of a
Full-body CT Dataset via Knowledge Aggregation and Anatomical Guidelines [113.08940153125616]
We generate a dataset of whole-body CT scans with $142$ voxel-level labels for 533 volumes providing comprehensive anatomical coverage.
Our proposed procedure does not rely on manual annotation during the label aggregation stage.
We release our trained unified anatomical segmentation model capable of predicting $142$ anatomical structures on CT data.
arXiv Detail & Related papers (2023-07-25T09:48:13Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - A Unified Framework of Medical Information Annotation and Extraction for
Chinese Clinical Text [1.4841452489515765]
Current state-of-the-art (SOTA) NLP models are highly integrated with deep learning techniques.
This study presents an engineering framework of medical entity recognition, relation extraction and attribute extraction.
arXiv Detail & Related papers (2022-03-08T03:19:16Z) - Automated tabulation of clinical trial results: A joint entity and
relation extraction approach with transformer-based language representations [5.825190876052148]
This paper investigates automating evidence table generation by decomposing the problem across two language processing tasks.
We focus on the automatic tabulation of sentences from published RCT abstracts that report the practice outcomes.
To train and test these models, a new gold-standard corpus was developed, comprising almost 600 result sentences from six disease areas.
arXiv Detail & Related papers (2021-12-10T15:26:43Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Machine Learning Based on Natural Language Processing to Detect Cardiac
Failure in Clinical Narratives [0.2936007114555107]
The purpose of the study is to develop a machine learning algorithm that automatically detects whether a patient has a cardiac failure or a healthy condition.
A word representation learning technique was employed by using bag-of-word (BoW), term frequency inverse document frequency (TFIDF), and neural word embeddings (word2vec)
The proposed framework yielded an overall classification performance with acc, pre, rec, and f1 of 84% and 82%, 85%, and 83%, respectively.
arXiv Detail & Related papers (2021-04-08T17:28:43Z) - Detecting of a Patient's Condition From Clinical Narratives Using
Natural Language Representation [0.3149883354098941]
This paper proposes a joint clinical natural language representation learning and supervised classification framework.
The novel framework jointly discovers distributional syntactic and latent semantic (representation learning) from contextual clinical narrative inputs.
The proposed framework yields an overall classification performance with accuracy, recall, and precision of 89 % and 88 %, 89 %, respectively.
arXiv Detail & Related papers (2021-04-08T17:16:04Z) - VerSe: A Vertebrae Labelling and Segmentation Benchmark for
Multi-detector CT Images [121.31355003451152]
Large Scale Vertebrae Challenge (VerSe) was organised in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) in 2019 and 2020.
We present the the results of this evaluation and further investigate the performance-variation at vertebra-level, scan-level, and at different fields-of-view.
arXiv Detail & Related papers (2020-01-24T21:09:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.