Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical
Data for Medical Coding with Large Language Models
- URL: http://arxiv.org/abs/2310.04595v1
- Date: Fri, 6 Oct 2023 21:20:28 GMT
- Title: Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical
Data for Medical Coding with Large Language Models
- Authors: Surjya Ray, Pratik Mehta, Hongen Zhang, Ada Chaman, Jian Wang,
Chung-Jen Ho, Michael Chiou, Tashfeen Suleman
- Abstract summary: We evaluate the impact of Large Language Models (LLMs) on medical coding on real-life noisy data.
We develop Segmented Harmonic Loss, a new loss function to address the extreme class imbalance that we found to prevail in most medical data in a multi-label scenario.
Our experimental results show that when trained with the proposed loss, the LLMs achieve significant performance gains even on noisy long-tailed datasets.
- Score: 1.5913129437464046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The precipitous rise and adoption of Large Language Models (LLMs) have
shattered expectations with the fastest adoption rate of any consumer-facing
technology in history. Healthcare, a field that traditionally uses NLP
techniques, was bound to be affected by this meteoric rise. In this paper, we
gauge the extent of the impact by evaluating the performance of LLMs for the
task of medical coding on real-life noisy data. We conducted several
experiments on MIMIC III and IV datasets with encoder-based LLMs, such as BERT.
Furthermore, we developed Segmented Harmonic Loss, a new loss function to
address the extreme class imbalance that we found to prevail in most medical
data in a multi-label scenario by segmenting and decoupling co-occurring
classes of the dataset with a new segmentation algorithm. We also devised a
technique based on embedding similarity to tackle noisy data. Our experimental
results show that when trained with the proposed loss, the LLMs achieve
significant performance gains even on noisy long-tailed datasets, outperforming
the F1 score of the state-of-the-art by over ten percentage points.
Related papers
- Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence Learning [6.44069573245889]
Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI)
We propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data.
In the fine-tuning phase, we utilize a knowledge distillation technique to align features between complete and missing modality data, simultaneously enhancing model robustness.
arXiv Detail & Related papers (2024-06-12T20:35:16Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Time Associated Meta Learning for Clinical Prediction [78.99422473394029]
We propose a novel time associated meta learning (TAML) method to make effective predictions at multiple future time points.
To address the sparsity problem after task splitting, TAML employs a temporal information sharing strategy to augment the number of positive samples.
We demonstrate the effectiveness of TAML on multiple clinical datasets, where it consistently outperforms a range of strong baselines.
arXiv Detail & Related papers (2023-03-05T03:54:54Z) - Long-term stable Electromyography classification using Canonical
Correlation Analysis [5.949779668853555]
Discrimination of hand gestures based on surface electromyography (sEMG) signals is a well-establish approach for controlling prosthetic devices.
One of the most critical challenges is maintaining high EMG data classification performance across multiple days without retraining the decoding system.
Here we propose a novel statistical method that stabilizes EMG classification performance across multiple days for long-term control of prosthetic devices.
arXiv Detail & Related papers (2023-01-23T21:45:00Z) - Machine Learning Performance Analysis to Predict Stroke Based on
Imbalanced Medical Dataset [0.0]
Cerebral stroke, the second most substantial cause of death universally, has been a primary public health concern over the last few years.
Medical dataset are frequently unbalanced in their class label, with a tendency to poorly predict minority classes.
In this paper, the potential risk factors for stroke are investigated.
Four distinctive approaches are applied to improve the classification of the minority class in the imbalanced stroke dataset.
arXiv Detail & Related papers (2022-11-14T17:36:46Z) - MS Lesion Segmentation: Revisiting Weighting Mechanisms for Federated
Learning [92.91544082745196]
Federated learning (FL) has been widely employed for medical image analysis.
FL's performance is limited for multiple sclerosis (MS) lesion segmentation tasks.
We propose the first FL MS lesion segmentation framework via two effective re-weighting mechanisms.
arXiv Detail & Related papers (2022-05-03T14:06:03Z) - Towards Robust Partially Supervised Multi-Structure Medical Image
Segmentation on Small-Scale Data [123.03252888189546]
We propose Vicinal Labels Under Uncertainty (VLUU) to bridge the methodological gaps in partially supervised learning (PSL) under data scarcity.
Motivated by multi-task learning and vicinal risk minimization, VLUU transforms the partially supervised problem into a fully supervised problem by generating vicinal labels.
Our research suggests a new research direction in label-efficient deep learning with partial supervision.
arXiv Detail & Related papers (2020-11-28T16:31:00Z) - Combined Cleaning and Resampling Algorithm for Multi-Class Imbalanced
Data with Label Noise [11.868507571027626]
In this paper, we propose a novel oversampling technique, a Multi-Class Combined Cleaning and Resampling algorithm.
The proposed method utilizes an energy-based approach to modeling the regions suitable for oversampling, less affected by small disjuncts and outliers than SMOTE.
It combines it with a simultaneous cleaning operation, the aim of which is to reduce the effect of overlapping class distributions on the performance of the learning algorithms.
arXiv Detail & Related papers (2020-04-07T13:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.