Automating Weak Label Generation for Data Programming with Clinicians in the Loop
- URL: http://arxiv.org/abs/2407.07982v1
- Date: Wed, 10 Jul 2024 18:29:22 GMT
- Title: Automating Weak Label Generation for Data Programming with Clinicians in the Loop
- Authors: Jean Park, Sydney Pugh, Kaustubh Sridhar, Mengyu Liu, Navish Yarna, Ramneet Kaur, Souradeep Dutta, Elena Bernardis, Oleg Sokolsky, Insup Lee,
- Abstract summary: We propose an algorithm that queries an expert for labels of a few representative samples of the dataset.
The labels assigned by the expert induce a labeling on the full dataset, thereby generating weak labels to be used in the data programming pipeline.
In our medical time series case study, labeling a subset of 50 to 130 out of 3,265 samples showed 17-28% improvement in accuracy and 13-28% improvement in F1 over the baseline.
- Score: 5.729255216041754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Deep Neural Networks (DNNs) are often data hungry and need high-quality labeled data in copious amounts for learning to converge. This is a challenge in the field of medicine since high quality labeled data is often scarce. Data programming has been the ray of hope in this regard, since it allows us to label unlabeled data using multiple weak labeling functions. Such functions are often supplied by a domain expert. Data-programming can combine multiple weak labeling functions and suggest labels better than simple majority voting over the different functions. However, it is not straightforward to express such weak labeling functions, especially in high-dimensional settings such as images and time-series data. What we propose in this paper is a way to bypass this issue, using distance functions. In high-dimensional spaces, it is easier to find meaningful distance metrics which can generalize across different labeling tasks. We propose an algorithm that queries an expert for labels of a few representative samples of the dataset. These samples are carefully chosen by the algorithm to capture the distribution of the dataset. The labels assigned by the expert on the representative subset induce a labeling on the full dataset, thereby generating weak labels to be used in the data programming pipeline. In our medical time series case study, labeling a subset of 50 to 130 out of 3,265 samples showed 17-28% improvement in accuracy and 13-28% improvement in F1 over the baseline using clinician-defined labeling functions. In our medical image case study, labeling a subset of about 50 to 120 images from 6,293 unlabeled medical images using our approach showed significant improvement over the baseline method, Snuba, with an increase of approximately 5-15% in accuracy and 12-19% in F1 score.
Related papers
- You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling [60.27812493442062]
We show the importance of investigating labeled data quality to improve any pseudo-labeling method.
Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling.
We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world datasets.
arXiv Detail & Related papers (2024-06-19T17:58:40Z) - Leveraging Fixed and Dynamic Pseudo-labels for Semi-supervised Medical Image Segmentation [7.9449756510822915]
Semi-supervised medical image segmentation has gained growing interest due to its ability to utilize unannotated data.
The current state-of-the-art methods mostly rely on pseudo-labeling within a co-training framework.
We propose a novel approach where multiple pseudo-labels for the same unannotated image are used to learn from the unlabeled data.
arXiv Detail & Related papers (2024-05-12T11:30:01Z) - INSITE: labelling medical images using submodular functions and
semi-supervised data programming [19.88996560236578]
Large amounts of labeled data to train deep models creates an implementation bottleneck in resource-constrained settings.
We apply informed subset selection to identify a small number of most representative or diverse images from a huge pool of unlabelled data.
The newly annotated images are then used as exemplars to develop several data programming-driven labeling functions.
arXiv Detail & Related papers (2024-02-11T12:02:00Z) - Adaptive Anchor Label Propagation for Transductive Few-Shot Learning [18.29463308334406]
Few-shot learning addresses the issue of classifying images using limited labeled data.
We propose a novel algorithm that adapts the feature embeddings of the labeled data by minimizing a differentiable loss function.
Our algorithm outperforms the standard label propagation algorithm by as much as 7% and 2% in the 1-shot and 5-shot settings respectively.
arXiv Detail & Related papers (2023-10-30T20:29:31Z) - ScarceNet: Animal Pose Estimation with Scarce Annotations [74.48263583706712]
ScarceNet is a pseudo label-based approach to generate artificial labels for the unlabeled images.
We evaluate our approach on the challenging AP-10K dataset, where our approach outperforms existing semi-supervised approaches by a large margin.
arXiv Detail & Related papers (2023-03-27T09:15:53Z) - Self-Supervised Learning as a Means To Reduce the Need for Labeled Data
in Medical Image Analysis [64.4093648042484]
We use a dataset of chest X-ray images with bounding box labels for 13 different classes of anomalies.
We show that it is possible to achieve similar performance to a fully supervised model in terms of mean average precision and accuracy with only 60% of the labeled data.
arXiv Detail & Related papers (2022-06-01T09:20:30Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Unsupervised Selective Labeling for More Effective Semi-Supervised
Learning [46.414510522978425]
unsupervised selective labeling consistently improves SSL methods over state-of-the-art active learning given labeled data.
Our work sets a new standard for practical and efficient SSL.
arXiv Detail & Related papers (2021-10-06T18:25:50Z) - A Study on the Autoregressive and non-Autoregressive Multi-label
Learning [77.11075863067131]
We propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly.
Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies.
arXiv Detail & Related papers (2020-12-03T05:41:44Z) - Analysis of label noise in graph-based semi-supervised learning [2.4366811507669124]
In machine learning, one must acquire labels to help supervise a model that will be able to generalize to unseen data.
It is often the case that most of our data is unlabeled.
Semi-supervised learning (SSL) alleviates that by making strong assumptions about the relation between the labels and the input data distribution.
arXiv Detail & Related papers (2020-09-27T22:13:20Z) - 3D medical image segmentation with labeled and unlabeled data using
autoencoders at the example of liver segmentation in CT images [58.720142291102135]
This work investigates the potential of autoencoder-extracted features to improve segmentation with a convolutional neural network.
A convolutional autoencoder was used to extract features from unlabeled data and a multi-scale, fully convolutional CNN was used to perform the target task of 3D liver segmentation in CT images.
arXiv Detail & Related papers (2020-03-17T20:20:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.