Efficient human-in-loop deep learning model training with iterative
refinement and statistical result validation
- URL: http://arxiv.org/abs/2304.00990v1
- Date: Mon, 3 Apr 2023 13:56:01 GMT
- Title: Efficient human-in-loop deep learning model training with iterative
refinement and statistical result validation
- Authors: Manuel Zahn, Douglas P. Perrin
- Abstract summary: We demonstrate a method for creating segmentations, a necessary part of a data cleaning for ultrasound imaging machine learning pipelines.
We propose a four-step method to leverage automatically generated training data and fast human visual checks to improve model accuracy while keeping the time/effort and cost low.
The method is demonstrated on a cardiac ultrasound segmentation task, removing background data, including static PHI.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Annotation and labeling of images are some of the biggest challenges in
applying deep learning to medical data. Current processes are time and
cost-intensive and, therefore, a limiting factor for the wide adoption of the
technology. Additionally validating that measured performance improvements are
significant is important to select the best model. In this paper, we
demonstrate a method for creating segmentations, a necessary part of a data
cleaning for ultrasound imaging machine learning pipelines. We propose a
four-step method to leverage automatically generated training data and fast
human visual checks to improve model accuracy while keeping the time/effort and
cost low. We also showcase running experiments multiple times to allow the
usage of statistical analysis. Poor quality automated ground truth data and
quick visual inspections efficiently train an initial base model, which is
refined using a small set of more expensive human-generated ground truth data.
The method is demonstrated on a cardiac ultrasound segmentation task, removing
background data, including static PHI. Significance is shown by running the
experiments multiple times and using the student's t-test on the performance
distributions. The initial segmentation accuracy of a simple thresholding
algorithm of 92% was improved to 98%. The performance of models trained on
complicated algorithms can be matched or beaten by pre-training with the poorer
performing algorithms and a small quantity of high-quality data. The
introduction of statistic significance analysis for deep learning models helps
to validate the performance improvements measured. The method offers a
cost-effective and fast approach to achieving high-accuracy models while
minimizing the cost and effort of acquiring high-quality training data.
Related papers
- SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training [12.745160748376794]
We propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness.
Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication.
Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps.
arXiv Detail & Related papers (2024-07-09T08:26:39Z) - CE-SSL: Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases Detection [16.34314710823127]
We propose a computation-efficient semi-supervised learning paradigm (CE-SSL) for robust and computation-efficient CVDs detection using ECG.
It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency.
CE-SSL not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space.
arXiv Detail & Related papers (2024-06-20T14:45:13Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Data Efficient Contrastive Learning in Histopathology using Active Sampling [0.0]
Deep learning algorithms can provide robust quantitative analysis in digital pathology.
These algorithms require large amounts of annotated training data.
Self-supervised methods have been proposed to learn features using ad-hoc pretext tasks.
We propose a new method for actively sampling informative members from the training set using a small proxy network.
arXiv Detail & Related papers (2023-03-28T18:51:22Z) - Training Efficiency and Robustness in Deep Learning [2.6451769337566406]
We study approaches to improve the training efficiency and robustness of deep learning models.
We find that prioritizing learning on more informative training data increases convergence speed and improves generalization performance on test data.
We show that a redundancy-aware modification to the sampling of training data improves the training speed and develops an efficient method for detecting the diversity of training signal.
arXiv Detail & Related papers (2021-12-02T17:11:33Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.