Related papers: SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes

SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes

URL: http://arxiv.org/abs/2310.18431v1
Date: Fri, 27 Oct 2023 19:09:30 GMT
Title: SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes
Authors: Adam D. Lelkes, Eric Loreaux, Tal Schuster, Ming-Jun Chen, Alvin Rajkomar
Abstract summary: Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly.
Score: 13.991819517682574
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data, largely due to the privacy and regulatory constraints on the use of real patients' information. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly. We formulate SDOH extraction as a natural language inference (NLI) task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets.

Related papers

Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments [0.0]
We show how direct and indirect identifiers vary by record type, clinical setting, and data flow, and show how changes in documentation practice can degrade model performance over time.<n>Our findings highlight that privacy risk is context-dependent and cumulative, underscoring the need for adaptable, hybrid de-identification approaches.
arXiv Detail & Related papers (2025-06-01T17:45:57Z)
Integration of Large Language Models and Traditional Deep Learning for Social Determinants of Health Prediction [23.8766239221373]
Social Determinants of Health (SDoH) are economic, social and personal circumstances that affect or influence an individual's health status.<n>We automatically extract SDoHs from clinical text using traditional deep learning and Large Language Models (LLMs)<n>Our models outperform a previous reference point on a multilabel SDoH classification by 10 points.
arXiv Detail & Related papers (2025-05-06T23:11:59Z)
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage [77.83757117924995]
We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
arXiv Detail & Related papers (2025-04-28T01:16:27Z)
A data-driven approach to discover and quantify systemic lupus erythematosus etiological heterogeneity from electronic health records [4.167173990365707]
Systemic lupus erythematosus (SLE) is a complex disease with many manifestational facets. We propose a data-driven approach to discover probabilistic independent sources from multimodal imperfect EHR data.
arXiv Detail & Related papers (2025-01-13T11:00:31Z)
LLM-Forest for Health Tabular Data Imputation [37.14344322899091]
Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation. We propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with confidence-based weighted voting. This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries.
arXiv Detail & Related papers (2024-10-28T20:42:46Z)
FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data [52.55123685248105]
Cardiovascular diseases (CVDs) are currently the leading cause of death worldwide, highlighting the critical need for early diagnosis and treatment. Machine learning (ML) methods can help diagnose CVDs early, but their performance relies on access to substantial data with high quality. This paper presents the first real-world FL benchmark for cardiovascular disease detection, named FedCVD.
arXiv Detail & Related papers (2024-10-28T02:24:01Z)
Controllable Synthetic Clinical Note Generation with Privacy Guarantees [7.1366477372157995]
In this paper, we introduce a novel method to "clone" datasets containing Personal Health Information (PHI) Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets.
arXiv Detail & Related papers (2024-09-12T07:38:34Z)
Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction [4.042918413611158]
Social determinants of health (SDOH) play an important role in health outcomes. Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH. Large language models (LLMs) have shown promise at automatically annotating structured data.
arXiv Detail & Related papers (2024-07-12T21:14:06Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Evaluating the Impact of Social Determinants on Health Prediction in the Intensive Care Unit [10.764842579064636]
Social determinants of health (SDOH) play a crucial role in a person's health and well-being. Most risk prediction models based on electronic health records do not incorporate a comprehensive set of SDOH features. Our work links a publicly available EHR database, MIMIC-IV, to well-documented SDOH features.
arXiv Detail & Related papers (2023-05-22T01:27:51Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain. We annotated a corpus of clinical documents according to 12 types of identifying entities. We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z)
Leveraging Natural Language Processing to Augment Structured Social Determinants of Health Data in the Electronic Health Record [1.7812428873698403]
Social determinants of health (SDOH) impact health outcomes. Clinical notes often contain more comprehensive SDOH information. We developed a novel SDOH extractor using a deep learning entity and relation extraction architecture.
arXiv Detail & Related papers (2022-12-14T22:51:49Z)
Automatically Identifying Semantic Bias in Crowdsourced Natural Language Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets. interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z)
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks. Data augmentation methods have been explored as a means of improving data efficiency in NLP. We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.