On the diminishing return of labeling clinical reports
- URL: http://arxiv.org/abs/2010.14587v1
- Date: Tue, 27 Oct 2020 19:51:04 GMT
- Title: On the diminishing return of labeling clinical reports
- Authors: Jean-Baptiste Lamare, Tobi Olatunji, Li Yao
- Abstract summary: We show that performant medical NLP models may be obtained with small amount of labeled data.
We show quantitatively the effect of training data size on a fixed test set composed of two of the largest public chest x-ray radiology report datasets.
- Score: 2.1431637042179683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ample evidence suggests that better machine learning models may be steadily
obtained by training on increasingly larger datasets on natural language
processing (NLP) problems from non-medical domains. Whether the same holds true
for medical NLP has by far not been thoroughly investigated. This work shows
that this is indeed not always the case. We reveal the somehow
counter-intuitive observation that performant medical NLP models may be
obtained with small amount of labeled data, quite the opposite to the common
belief, most likely due to the domain specificity of the problem. We show
quantitatively the effect of training data size on a fixed test set composed of
two of the largest public chest x-ray radiology report datasets on the task of
abnormality classification. The trained models not only make use of the
training data efficiently, but also outperform the current state-of-the-art
rule-based systems by a significant margin.
Related papers
- Weakly supervised deep learning model with size constraint for prostate cancer detection in multiparametric MRI and generalization to unseen domains [0.90668179713299]
We show that the model achieves on-par performance with strong fully supervised baseline models.
We also observe a performance decrease for both fully supervised and weakly supervised models when tested on unseen data domains.
arXiv Detail & Related papers (2024-11-04T12:24:33Z) - How Does Pruning Impact Long-Tailed Multi-Label Medical Image
Classifiers? [49.35105290167996]
Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance.
This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification.
arXiv Detail & Related papers (2023-08-17T20:40:30Z) - DIAGNOSE: Avoiding Out-of-distribution Data using Submodular Information
Measures [13.492292022589918]
We propose Diagnose, a novel active learning framework that can jointly model similarity and dissimilarity.
Our experiments verify the superiority of Diagnose over the state-of-the-art AL methods across multiple domains of medical imaging.
arXiv Detail & Related papers (2022-10-04T11:07:48Z) - Potential sources of dataset bias complicate investigation of
underdiagnosis by machine learning algorithms [20.50071537200745]
Seyyed-Kalantari et al. find that models trained on three chest X-ray datasets yield disparities in false-positive rates.
The study concludes that the models exhibit and potentially even amplify systematic underdiagnosis.
arXiv Detail & Related papers (2022-01-19T20:51:38Z) - Neural Medication Extraction: A Comparison of Recent Models in
Supervised and Semi-supervised Learning Settings [0.751289645756884]
Drug prescriptions are essential information that must be encoded in electronic medical records.
This is why the medication extraction task has emerged.
We present an independent and comprehensive evaluation of state-of-the-art neural architectures on the I2B2 medical prescription extraction task.
arXiv Detail & Related papers (2021-10-19T19:23:38Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Relational Subsets Knowledge Distillation for Long-tailed Retinal
Diseases Recognition [65.77962788209103]
We propose class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge.
It enforces the model to focus on learning the subset-specific knowledge.
The proposed framework proved to be effective for the long-tailed retinal diseases recognition task.
arXiv Detail & Related papers (2021-04-22T13:39:33Z) - A Hamiltonian Monte Carlo Model for Imputation and Augmentation of
Healthcare Data [0.6719751155411076]
Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available.
Existing models usually do not consider privacy concerns or do not utilise the inherent correlations across multiple features to impute the missing values.
A Bayesian approach to impute missing values and creating augmented samples in high dimensional healthcare data is proposed in this work.
arXiv Detail & Related papers (2021-03-03T11:57:42Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.