Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models
- URL: http://arxiv.org/abs/2507.03433v1
- Date: Fri, 04 Jul 2025 09:41:33 GMT
- Title: Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models
- Authors: Adrien Bazoge, Pacôme Constant dit Beaufils, Mohammed Hmitouch, Romain Bourcier, Emmanuel Morin, Richard Dufour, Béatrice Daille, Pierre-Antoine Gourraud, Matilde Karakachoff,
- Abstract summary: Social determinants of health (SDoH) influence health outcomes, shaping disease progression, treatment adherence, and health disparities.<n>This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes.<n>We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France.
- Score: 5.070772241416699
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.
Related papers
- Extracting Patient History from Clinical Text: A Comparative Study of Clinical Large Language Models [3.1277841304339065]
This study evaluates the performance of clinical large language models (cLLMs) in recognizing medical history entities (MHEs)<n>We annotated 1,449 MHEs across 61 outpatient-related clinical notes from the MTSamples repository.<n>The cLLMs showed potential in reducing the time required for extracting MHEs by over 20%.
arXiv Detail & Related papers (2025-03-30T02:00:56Z) - SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH) [43.79125048893811]
We introduce SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method to extract social determinants of health from medical notes.
It achieved tenfold and twentyfold reductions in time and cost respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92.
This study highlights the potential of leveraging LLMs to revolutionize medical note classification, demonstrating their capability to achieve highly accurate classifications with significantly reduced time and cost.
arXiv Detail & Related papers (2024-07-24T09:57:51Z) - Extracting Social Determinants of Health from Pediatric Patient Notes Using Large Language Models: Novel Corpus and Methods [17.83326146480516]
Social determinants of health (SDoH) play a critical role in shaping health outcomes.
We present a novel annotated corpus, the Pediatric Social History Corpus (PedSHAC)
We evaluate the automatic extraction of detailed SDoH representations using fine-tuned and in-context learning methods.
arXiv Detail & Related papers (2024-03-31T23:37:18Z) - Sensitivity, Performance, Robustness: Deconstructing the Effect of
Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give.
We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z) - Large Language Models to Identify Social Determinants of Health in
Electronic Health Records [2.168737004368243]
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHRs)
This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented.
800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated.
arXiv Detail & Related papers (2023-08-11T19:18:35Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - A Study of Social and Behavioral Determinants of Health in Lung Cancer
Patients Using Transformers-based Natural Language Processing Models [23.68697811086486]
Social and behavioral determinants of health (SBDoH) have important roles in shaping people's health.
There are limited studies to examine SBDoH factors in clinical outcomes due to the lack of structured SBDoH information in current electronic health record systems.
Natural language processing (NLP) is thus the key technology to extract such information from unstructured clinical text.
arXiv Detail & Related papers (2021-08-10T22:11:31Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.