Medical Scientific Table-to-Text Generation with Human-in-the-Loop under
the Data Sparsity Constraint
- URL: http://arxiv.org/abs/2205.12368v1
- Date: Tue, 24 May 2022 21:10:57 GMT
- Title: Medical Scientific Table-to-Text Generation with Human-in-the-Loop under
the Data Sparsity Constraint
- Authors: Heng-Yi Wu, Jingqing Zhang, Julia Ive, Tong Li, Narges Tabari,
Bingyuan Chen, Vibhor Gupta, Yike Guo
- Abstract summary: An efficient tableto-text summarization system can drastically reduce manual efforts to condense this data into reports.
However, in practice, the problem is heavily impeded by the data paucity, data sparsity and inability of the state-of-the-art natural language generation models to produce accurate and reliable outputs.
We propose a novel table-to-text approach and tackle these problems with a novel two-step architecture which is enhanced by auto-correction, copy mechanism and synthetic data augmentation.
- Score: 11.720364723821993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Structured (tabular) data in the preclinical and clinical domains contains
valuable information about individuals and an efficient table-to-text
summarization system can drastically reduce manual efforts to condense this
data into reports. However, in practice, the problem is heavily impeded by the
data paucity, data sparsity and inability of the state-of-the-art natural
language generation models (including T5, PEGASUS and GPT-Neo) to produce
accurate and reliable outputs. In this paper, we propose a novel table-to-text
approach and tackle these problems with a novel two-step architecture which is
enhanced by auto-correction, copy mechanism and synthetic data augmentation.
The study shows that the proposed approach selects salient biomedical entities
and values from structured data with improved precision (up to 0.13 absolute
increase) of copying the tabular values to generate coherent and accurate text
for assay validation reports and toxicology reports. Moreover, we also
demonstrate a light-weight adaptation of the proposed system to new datasets by
fine-tuning with as little as 40\% training examples. The outputs of our model
are validated by human experts in the Human-in-the-Loop scenario.
Related papers
- Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports [2.932283627137903]
The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports for isocitrate dehydrogenase (IDH) mutation status.
arXiv Detail & Related papers (2024-09-15T15:21:45Z) - Enhancing Clinical Documentation with Synthetic Data: Leveraging Generative Models for Improved Accuracy [0.0]
This paper proposes a novel approach to augment clinical documentation by leveraging synthetic data generation techniques.
We present a methodology that combines state-of-the-art generative models, such as Generative Adrial Networks (GANs) and Variational Autoencoders (VAEs)
We demonstrate the effectiveness of our approach in generating high-quality synthetic transcripts that closely resemble real-world data.
arXiv Detail & Related papers (2024-06-03T15:49:03Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images.
Our approach fuses image and textual data to enhance the generation process.
We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z) - Leveraging text data for causal inference using electronic health records [1.4182510510164876]
This paper presents a unified framework for leveraging text data to support causal inference with electronic health data.
We show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect.
We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited.
arXiv Detail & Related papers (2023-06-09T16:06:02Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - P-Transformer: A Prompt-based Multimodal Transformer Architecture For
Medical Tabular Data [2.6487114372147182]
We propose P-Transformer, a Prompt-based multimodal Transformer architecture designed specifically for medical tabular data.
The framework efficiently encodes diverse modalities from both structured and unstructured data into a harmonized language semantic space.
P-Transformer demonstrated the improvements with 10.9%/11.0% on RMSE/MAE, 0.5%/2.2% on RMSE/MAE, and 1.6%/0.8% on BACC/AUROC compared to state-of-the-art (SOTA) baselines in predictability.
arXiv Detail & Related papers (2023-03-30T14:25:44Z) - Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records.
We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data.
We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z) - Estimating Redundancy in Clinical Text [6.245180523143739]
Clinicians populate new documents by duplicating existing notes, then updating accordingly.
quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives.
We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model.
arXiv Detail & Related papers (2021-05-25T11:01:45Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.