SDOH-NLI: a Dataset for Inferring Social Determinants of Health from
Clinical Notes
- URL: http://arxiv.org/abs/2310.18431v1
- Date: Fri, 27 Oct 2023 19:09:30 GMT
- Title: SDOH-NLI: a Dataset for Inferring Social Determinants of Health from
Clinical Notes
- Authors: Adam D. Lelkes, Eric Loreaux, Tal Schuster, Ming-Jun Chen, Alvin
Rajkomar
- Abstract summary: Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes.
Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data.
This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly.
- Score: 13.991819517682574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social and behavioral determinants of health (SDOH) play a significant role
in shaping health outcomes, and extracting these determinants from clinical
notes is a first step to help healthcare providers systematically identify
opportunities to provide appropriate care and address disparities. Progress on
using NLP methods for this task has been hindered by the lack of high-quality
publicly available labeled data, largely due to the privacy and regulatory
constraints on the use of real patients' information. This paper introduces a
new dataset, SDOH-NLI, that is based on publicly available notes and which we
release publicly. We formulate SDOH extraction as a natural language inference
(NLI) task, and provide binary textual entailment labels obtained from human
raters for a cross product of a set of social history snippets as premises and
SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in
that our premises and hypotheses are obtained independently. We evaluate both
"off-the-shelf" entailment models as well as models fine-tuned on our data, and
highlight the ways in which our dataset appears more challenging than commonly
used NLI datasets.
Related papers
- LLM-Forest for Health Tabular Data Imputation [37.14344322899091]
Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation.
We propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with confidence-based weighted voting.
This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries.
arXiv Detail & Related papers (2024-10-28T20:42:46Z) - FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data [52.55123685248105]
Cardiovascular diseases (CVDs) are currently the leading cause of death worldwide, highlighting the critical need for early diagnosis and treatment.
Machine learning (ML) methods can help diagnose CVDs early, but their performance relies on access to substantial data with high quality.
This paper presents the first real-world FL benchmark for cardiovascular disease detection, named FedCVD.
arXiv Detail & Related papers (2024-10-28T02:24:01Z) - Controllable Synthetic Clinical Note Generation with Privacy Guarantees [7.1366477372157995]
In this paper, we introduce a novel method to "clone" datasets containing Personal Health Information (PHI)
Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy.
We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets.
arXiv Detail & Related papers (2024-09-12T07:38:34Z) - Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction [4.042918413611158]
Social determinants of health (SDOH) play an important role in health outcomes.
Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH.
Large language models (LLMs) have shown promise at automatically annotating structured data.
arXiv Detail & Related papers (2024-07-12T21:14:06Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Evaluating the Impact of Social Determinants on Health Prediction in the
Intensive Care Unit [10.764842579064636]
Social determinants of health (SDOH) play a crucial role in a person's health and well-being.
Most risk prediction models based on electronic health records do not incorporate a comprehensive set of SDOH features.
Our work links a publicly available EHR database, MIMIC-IV, to well-documented SDOH features.
arXiv Detail & Related papers (2023-05-22T01:27:51Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Leveraging Natural Language Processing to Augment Structured Social
Determinants of Health Data in the Electronic Health Record [1.7812428873698403]
Social determinants of health (SDOH) impact health outcomes.
Clinical notes often contain more comprehensive SDOH information.
We developed a novel SDOH extractor using a deep learning entity and relation extraction architecture.
arXiv Detail & Related papers (2022-12-14T22:51:49Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.