Validation of a Zero-Shot Learning Natural Language Processing Tool for
Data Abstraction from Unstructured Healthcare Data
- URL: http://arxiv.org/abs/2308.00107v1
- Date: Sun, 23 Jul 2023 17:52:28 GMT
- Title: Validation of a Zero-Shot Learning Natural Language Processing Tool for
Data Abstraction from Unstructured Healthcare Data
- Authors: Basil Kaufmann, Dallin Busby, Chandan Krushna Das, Neeraja Tillu, Mani
Menon, Ashutosh K. Tewari, Michael A. Gorin
- Abstract summary: A data abstraction tool was developed based on the GPT3.5 model from OpenAI.
It was compared to three human abstractors in terms of time to task completion and accuracy for abstracting data.
The tool was assessed for superiority for data abstraction speed and non-inferiority for accuracy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Objectives: To describe the development and validation of a zero-shot
learning natural language processing (NLP) tool for abstracting data from
unstructured text contained within PDF documents, such as those found within
electronic health records. Materials and Methods: A data abstraction tool based
on the GPT-3.5 model from OpenAI was developed and compared to three physician
human abstractors in terms of time to task completion and accuracy for
abstracting data on 14 unique variables from a set of 199 de-identified radical
prostatectomy pathology reports. The reports were processed by the software
tool in vectorized and scanned formats to establish the impact of optical
character recognition on data abstraction. The tool was assessed for
superiority for data abstraction speed and non-inferiority for accuracy.
Results: The human abstractors required a mean of 101s per report for data
abstraction, with times varying from 15 to 284 s. In comparison, the software
tool required a mean of 12.8 s to process the vectorized reports and a mean of
15.8 to process the scanned reports (P < 0.001). The overall accuracies of the
three human abstractors were 94.7%, 97.8%, and 96.4% for the combined set of
2786 datapoints. The software tool had an overall accuracy of 94.2% for the
vectorized reports, proving to be non-inferior to the human abstractors at a
margin of -10% ($\alpha$=0.025). The tool had a slightly lower accuracy of
88.7% using the scanned reports, proving to be non-inferiority to 2 out of 3
human abstractors. Conclusion: The developed zero-shot learning NLP tool
affords researchers comparable levels of accuracy to that of human abstractors,
with significant time savings benefits. Because of the lack of need for
task-specific model training, the developed tool is highly generalizable and
can be used for a wide variety of data abstraction tasks, even outside the
field of medicine.
Related papers
- Leveraging large language models for structured information extraction from pathology reports [0.0]
We evaluate large language models' accuracy in extracting structured information from breast cancer histopathology reports.
Open-source tool for structured information extraction can be customized by non-programmers using natural language.
arXiv Detail & Related papers (2025-02-14T21:46:02Z) - Developing an efficient corpus using Ensemble Data cleaning approach [0.0]
This research aims to clean a medical dataset using ensemble techniques and to develop a corpus.
The data cleaning method in this research suggests that the ensemble technique provides the highest accuracy (94%) compared to the single process.
It underscores the importance of NLP in the medical field, where accurate and timely information extraction can be a matter of life and death.
arXiv Detail & Related papers (2024-06-02T16:03:31Z) - Investigating Deep-Learning NLP for Automating the Extraction of
Oncology Efficacy Endpoints from Scientific Literature [0.0]
We have developed and optimised a framework to extract efficacy endpoints from text in scientific papers.
Our machine learning model predicts 25 classes associated with efficacy endpoints and leads to high F1 scores.
arXiv Detail & Related papers (2023-11-03T14:01:54Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Detecting automatically the layout of clinical documents to enhance the
performances of downstream natural language processing [53.797797404164946]
We designed an algorithm to process clinical PDF documents and extract only clinically relevant text.
The algorithm consists of several steps: initial text extraction using a PDF, followed by classification into such categories as body text, left notes, and footers.
Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections.
arXiv Detail & Related papers (2023-05-23T08:38:33Z) - Optimising Human-Machine Collaboration for Efficient High-Precision
Information Extraction from Text Documents [23.278525774427607]
We consider the benefits and drawbacks of various human-only, human-machine, and machine-only information extraction approaches.
We present a framework and an accompanying tool for information extraction using weak-supervision labelling with human validation.
We find that the combination of computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time.
arXiv Detail & Related papers (2023-02-18T13:07:22Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - A Unified Framework of Medical Information Annotation and Extraction for
Chinese Clinical Text [1.4841452489515765]
Current state-of-the-art (SOTA) NLP models are highly integrated with deep learning techniques.
This study presents an engineering framework of medical entity recognition, relation extraction and attribute extraction.
arXiv Detail & Related papers (2022-03-08T03:19:16Z) - Chest x-ray automated triage: a semiologic approach designed for
clinical implementation, exploiting different types of labels through a
combination of four Deep Learning architectures [83.48996461770017]
This work presents a Deep Learning method based on the late fusion of different convolutional architectures.
We built four training datasets combining images from public chest x-ray datasets and our institutional archive.
We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool.
arXiv Detail & Related papers (2020-12-23T14:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.