SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data
- URL: http://arxiv.org/abs/2601.09717v1
- Date: Thu, 25 Dec 2025 01:52:46 GMT
- Title: SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data
- Authors: Yiwei Yan, Hao Li, Hua He, Gong Kai, Zhengyi Yang, Guanfeng Liu,
- Abstract summary: This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data.<n>We concluded health-data classification and grading rules in accordance with GB/T 39725-2020.
- Score: 7.015777723337828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.
Related papers
- A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context [82.32380418146656]
Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
arXiv Detail & Related papers (2026-01-25T01:28:52Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - Scaling behavior of large language models in emotional safety classification across sizes and tasks [0.0]
We investigate the scaling behavior of large language models (LLMs) on two key tasks: trinary classification of emotional safety and multi-label classification.<n>We construct a novel dataset by merging several human-authored mental health datasets.<n>We evaluate four LLaMA models (1B, 3B, 8B, 70B) across zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2025-09-02T20:53:03Z) - A Graph-Based Test-Harness for LLM Evaluation [0.8164433158925593]
We present a first known prototype of a dynamic, systematic benchmark of medical guidelines for 400+ questions.<n>We transform the WHO IMCI handbook into a directed graph with 200+ nodes and generate questions that incorporate age-specific scenarios.<n>We find models excel at symptom recognition but struggle with triaging severity, treatment protocols and follow-up care.
arXiv Detail & Related papers (2025-08-28T14:10:59Z) - KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification [4.8342038441006805]
This research presents Knowledge Distillation for Healthcare Multi-Label Text Classification (KDH-MLTC)<n>The proposed approach addresses conventional healthcare Multi-Label Text Classification challenges by integrating knowledge distillation and sequential fine-tuning.<n>Experiments conducted on three medical literature datasets demonstrate that KDH-MLTC achieves superior performance compared to existing approaches.
arXiv Detail & Related papers (2025-05-12T00:58:25Z) - BingoGuard: LLM Content Moderation Tools with Risk Levels [67.53167973090356]
Malicious content generated by large language models (LLMs) can pose varying degrees of harm.<n>In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system.
arXiv Detail & Related papers (2025-03-09T10:43:09Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [20.31796453890812]
HealthQ is a framework for evaluating the questioning capabilities of large language models (LLMs) in healthcare conversations.<n>We integrate an LLM judge to evaluate generated questions across metrics such as specificity, relevance, and usefulness.<n>We present the first systematic framework for assessing questioning capabilities in healthcare conversations, establish a model-agnostic evaluation methodology, and provide empirical evidence linking high-quality questions to improved patient information elicitation.
arXiv Detail & Related papers (2024-09-28T23:59:46Z) - Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement [10.611952462532908]
Multimodal ECG Representation Learning (MERL) is capable of performing zero-shot ECG classification with text prompts.
We propose the Clinical Knowledge Enhanced Prompt Engineering (CKEPE) approach to exploit external expert-verified clinical knowledge databases.
MERL achieves an average AUC score of 75.2% in zero-shot classification (without training data), 3.2% higher than linear probed eSSL methods with 10% annotated training data, averaged across all six datasets.
arXiv Detail & Related papers (2024-03-11T12:28:55Z) - A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises [59.4999994297993]
This comprehensive review aims to provide an overview of the current state of Healthcare Knowledge Graphs (HKGs)<n>We thoroughly analyzed existing literature on HKGs, covering their construction methodologies, utilization techniques, and applications.<n>The review highlights the potential of HKGs to significantly impact biomedical research and clinical practice.
arXiv Detail & Related papers (2023-06-07T21:51:56Z) - Hierarchical Reinforcement Learning for Automatic Disease Diagnosis [52.111516253474285]
We propose to integrate a hierarchical policy structure of two levels into the dialogue systemfor policy learning.
The proposed policy structure is capable to deal with diagnosis problem including large number of diseases and symptoms.
arXiv Detail & Related papers (2020-04-29T15:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.