Related papers: Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical Data for Medical Coding with Large Language Models

Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical Data for Medical Coding with Large Language Models

URL: http://arxiv.org/abs/2310.04595v1
Date: Fri, 6 Oct 2023 21:20:28 GMT
Title: Segmented Harmonic Loss: Handling Class-Imbalanced Multi-Label Clinical Data for Medical Coding with Large Language Models
Authors: Surjya Ray, Pratik Mehta, Hongen Zhang, Ada Chaman, Jian Wang, Chung-Jen Ho, Michael Chiou, Tashfeen Suleman
Abstract summary: We evaluate the impact of Large Language Models (LLMs) on medical coding on real-life noisy data. We develop Segmented Harmonic Loss, a new loss function to address the extreme class imbalance that we found to prevail in most medical data in a multi-label scenario. Our experimental results show that when trained with the proposed loss, the LLMs achieve significant performance gains even on noisy long-tailed datasets.
Score: 1.5913129437464046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The precipitous rise and adoption of Large Language Models (LLMs) have shattered expectations with the fastest adoption rate of any consumer-facing technology in history. Healthcare, a field that traditionally uses NLP techniques, was bound to be affected by this meteoric rise. In this paper, we gauge the extent of the impact by evaluating the performance of LLMs for the task of medical coding on real-life noisy data. We conducted several experiments on MIMIC III and IV datasets with encoder-based LLMs, such as BERT. Furthermore, we developed Segmented Harmonic Loss, a new loss function to address the extreme class imbalance that we found to prevail in most medical data in a multi-label scenario by segmenting and decoupling co-occurring classes of the dataset with a new segmentation algorithm. We also devised a technique based on embedding similarity to tackle noisy data. Our experimental results show that when trained with the proposed loss, the LLMs achieve significant performance gains even on noisy long-tailed datasets, outperforming the F1 score of the state-of-the-art by over ten percentage points.

Related papers

Differential-UMamba: Rethinking Tumor Segmentation Under Limited Data Scenarios [3.1231963031043786]
We introduce Diff-UMamba, a novel architecture that combines the UNet framework with the mamba mechanism to model long-range dependencies.<n>At the heart of Diff-UMamba is a noise reduction module, which employs a signal differencing strategy to suppress noisy or irrelevant activations.<n>The architecture achieves improved segmentation accuracy and robustness, particularly in low-data settings.
arXiv Detail & Related papers (2025-07-24T08:23:11Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
Fine-tuning can Help Detect Pretraining Data from Large Language Models [7.7209640786782385]
Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%. We introduce a novel and effective method termed Fine-tuned Score Deviation(FSD), which improves the performance of current scoring functions for pretraining data detection. In particular, we propose to measure the deviation distance of current scores after fine-tuning on a small amount of unseen data within the same domain.
arXiv Detail & Related papers (2024-10-09T15:36:42Z)
Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets [4.542616945567623]
This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL) AGCL is designed to enhance classification performance on small medical datasets through synthetic data generation. It was applied to Parkinson's disease screening, utilizing facial expression data.
arXiv Detail & Related papers (2024-09-26T09:51:08Z)
When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics. We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors. Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z)
Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios. MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data. Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z)
Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence Learning [6.44069573245889]
Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI) We propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data. In the fine-tuning phase, we utilize a knowledge distillation technique to align features between complete and missing modality data, simultaneously enhancing model robustness.
arXiv Detail & Related papers (2024-06-12T20:35:16Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Time Associated Meta Learning for Clinical Prediction [78.99422473394029]
We propose a novel time associated meta learning (TAML) method to make effective predictions at multiple future time points. To address the sparsity problem after task splitting, TAML employs a temporal information sharing strategy to augment the number of positive samples. We demonstrate the effectiveness of TAML on multiple clinical datasets, where it consistently outperforms a range of strong baselines.
arXiv Detail & Related papers (2023-03-05T03:54:54Z)
Long-term stable Electromyography classification using Canonical Correlation Analysis [5.949779668853555]
Discrimination of hand gestures based on surface electromyography (sEMG) signals is a well-establish approach for controlling prosthetic devices. One of the most critical challenges is maintaining high EMG data classification performance across multiple days without retraining the decoding system. Here we propose a novel statistical method that stabilizes EMG classification performance across multiple days for long-term control of prosthetic devices.
arXiv Detail & Related papers (2023-01-23T21:45:00Z)
MS Lesion Segmentation: Revisiting Weighting Mechanisms for Federated Learning [92.91544082745196]
Federated learning (FL) has been widely employed for medical image analysis. FL's performance is limited for multiple sclerosis (MS) lesion segmentation tasks. We propose the first FL MS lesion segmentation framework via two effective re-weighting mechanisms.
arXiv Detail & Related papers (2022-05-03T14:06:03Z)
Towards Robust Partially Supervised Multi-Structure Medical Image Segmentation on Small-Scale Data [123.03252888189546]
We propose Vicinal Labels Under Uncertainty (VLUU) to bridge the methodological gaps in partially supervised learning (PSL) under data scarcity. Motivated by multi-task learning and vicinal risk minimization, VLUU transforms the partially supervised problem into a fully supervised problem by generating vicinal labels. Our research suggests a new research direction in label-efficient deep learning with partial supervision.
arXiv Detail & Related papers (2020-11-28T16:31:00Z)
Combined Cleaning and Resampling Algorithm for Multi-Class Imbalanced Data with Label Noise [11.868507571027626]
In this paper, we propose a novel oversampling technique, a Multi-Class Combined Cleaning and Resampling algorithm. The proposed method utilizes an energy-based approach to modeling the regions suitable for oversampling, less affected by small disjuncts and outliers than SMOTE. It combines it with a simultaneous cleaning operation, the aim of which is to reduce the effect of overlapping class distributions on the performance of the learning algorithms.
arXiv Detail & Related papers (2020-04-07T13:59:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.