Towards Structuring Real-World Data at Scale: Deep Learning for
Extracting Key Oncology Information from Clinical Text with Patient-Level
Supervision
- URL: http://arxiv.org/abs/2203.10442v1
- Date: Sun, 20 Mar 2022 03:42:03 GMT
- Title: Towards Structuring Real-World Data at Scale: Deep Learning for
Extracting Key Oncology Information from Clinical Text with Patient-Level
Supervision
- Authors: Sam Preston, Mu Wei, Rajesh Rao, Robert Tinn, Naoto Usuyama, Michael
Lucas, Roshanthi Weerasinghe, Soohee Lee, Brian Piening, Paul Tittel, Naveen
Valluri, Tristan Naumann, Carlo Bifulco, Hoifung Poon
- Abstract summary: The majority of detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents.
Traditional rule-based systems are vulnerable to the prevalent linguistic variations and ambiguities in clinical text.
We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information.
- Score: 10.929271646369887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective: The majority of detailed patient information in real-world data
(RWD) is only consistently available in free-text clinical documents. Manual
curation is expensive and time-consuming. Developing natural language
processing (NLP) methods for structuring RWD is thus essential for scaling
real-world evidence generation.
Materials and Methods: Traditional rule-based systems are vulnerable to the
prevalent linguistic variations and ambiguities in clinical text, and prior
applications of machine-learning methods typically require sentence-level or
report-level labeled examples that are hard to produce at scale. We propose
leveraging patient-level supervision from medical registries, which are often
readily available and capture key patient information, for general RWD
applications. To combat the lack of sentence-level or report-level annotations,
we explore advanced deep-learning methods by combining domain-specific
pretraining, recurrent neural networks, and hierarchical attention.
Results: We conduct an extensive study on 135,107 patients from the cancer
registry of a large integrated delivery network (IDN) comprising healthcare
systems in five western US states. Our deep learning methods attain test AUROC
of 94-99% for key tumor attributes and comparable performance on held-out data
from separate health systems and states.
Discussion and Conclusion: Ablation results demonstrate clear superiority of
these advanced deep-learning methods over prior approaches. Error analysis
shows that our NLP system sometimes even corrects errors in registrar labels.
We also conduct a preliminary investigation in accelerating registry curation
and general RWD structuring via assisted curation for over 1.2 million cancer
patients in this healthcare network.
Related papers
- Assertion Detection Large Language Model In-context Learning LoRA
Fine-tuning [2.401755243180179]
We introduce a novel methodology that utilizes Large Language Models (LLMs) pre-trained on a vast array of medical data for assertion detection.
Our approach achieved an F-1 of 0.74, which is 0.31 higher than the previous method.
arXiv Detail & Related papers (2024-01-31T05:11:00Z) - README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP [9.432205523734707]
We introduce a new task of automatically generating lay definitions, aiming to simplify medical terms into patient-friendly lay language.
We first created the dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions.
We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality.
arXiv Detail & Related papers (2023-12-24T23:01:00Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community.
The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored.
We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z) - TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural
Language Processing [31.190757020836656]
We focus on the core task of the TCM diagnosis and treatment system -- syndrome differentiation (SD)
Our dataset contains 54,152 real-world clinical records covering 148 syndromes.
We propose a domain-specific pre-trained language model, called ZY-BERT.
arXiv Detail & Related papers (2022-03-21T09:59:54Z) - Federated Cycling (FedCy): Semi-supervised Federated Learning of
Surgical Phases [57.90226879210227]
FedCy is a semi-supervised learning (FSSL) method that combines FL and self-supervised learning to exploit a decentralized dataset of both labeled and unlabeled videos.
We demonstrate significant performance gains over state-of-the-art FSSL methods on the task of automatic recognition of surgical phases.
arXiv Detail & Related papers (2022-03-14T17:44:53Z) - A Systematic Review of Natural Language Processing Applied to Radiology
Reports [3.600747505433814]
This study systematically assesses recent literature in NLP applied to radiology reports.
Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics.
arXiv Detail & Related papers (2021-02-18T18:54:41Z) - Uncovering the structure of clinical EEG signals with self-supervised
learning [64.4754948595556]
Supervised learning paradigms are often limited by the amount of labeled data that is available.
This phenomenon is particularly problematic in clinically-relevant data, such as electroencephalography (EEG)
By extracting information from unlabeled data, it might be possible to reach competitive performance with deep neural networks.
arXiv Detail & Related papers (2020-07-31T14:34:47Z) - Natural Language Processing with Deep Learning for Medical Adverse Event
Detection from Free-Text Medical Narratives: A Case Study of Detecting Total
Hip Replacement Dislocation [0.0]
We propose deep learning based NLP (DL-NLP) models for efficient and accurate hip dislocation AE detection following total hip replacement.
We benchmarked these proposed models with a wide variety of traditional machine learning based NLP (ML-NLP) models.
All DL-NLP models out-performed all of the ML-NLP models, with a convolutional neural network (CNN) model achieving the best overall performance.
arXiv Detail & Related papers (2020-04-17T16:25:36Z) - DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment
Prediction [67.91606509226132]
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment.
DeepEnroll is a cross-modal inference learning model to jointly encode enrollment criteria (tabular data) into a shared latent space for matching inference.
arXiv Detail & Related papers (2020-01-22T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.