Related papers: DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

URL: http://arxiv.org/abs/2510.00038v2
Date: Thu, 02 Oct 2025 19:56:21 GMT
Title: DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management
Authors: Maria Ana Cardei, Josephine Lamp, Mark Derdzinski, Karan Bhatia,
Abstract summary: We present DM-Bench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks.<n>Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask.<n>By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.
Score: 1.8289322713164966
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present DM-Bench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks faced by individuals managing diabetes in their daily lives. Unlike prior health benchmarks that are either generic, clinician-facing or focused on clinical tasks (e.g., diagnosis, triage), DM-Bench introduces a comprehensive evaluation framework tailored to the unique challenges of prototyping patient-facing AI solutions in diabetes, glucose management, metabolic health and related domains. Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask, including basic glucose interpretation, educational queries, behavioral associations, advanced decision making and long term planning. Towards this end, we compile a rich dataset comprising one month of time-series data encompassing glucose traces and metrics from continuous glucose monitors (CGMs) and behavioral logs (e.g., eating and activity patterns) from 15,000 individuals across three different diabetes populations (type 1, type 2, pre-diabetes/general health and wellness). Using this data, we generate a total of 360,600 personalized, contextual questions across the 7 tasks. We evaluate model performance on these tasks across 5 metrics: accuracy, groundedness, safety, clarity and actionability. Our analysis of 8 recent LLMs reveals substantial variability across tasks and metrics; no single model consistently outperforms others across all dimensions. By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.

Related papers

Glucose-ML: A collection of longitudinal diabetes datasets for development of robust AI solutions [0.0]
Glucose-ML is a collection of 10 publicly available diabetes datasets.<n>The Glucose-ML collection comprises over 300,000 days of continuous glucose monitor (CGM) data.<n>We conduct a case study for the task of blood glucose prediction.
arXiv Detail & Related papers (2025-07-18T16:53:05Z)
Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers [44.735350554750916]
We remotely recruited the largest dataset to date across the US to study insulin resistance.<n>Deep neural network models were developed to predict insulin resistance based on readily available digital and blood biomarkers.<n>Model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants.
arXiv Detail & Related papers (2025-04-30T16:10:20Z)
GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks [0.12564343689544843]
Continuous glucose monitors (CGM) are small medical devices that measure blood glucose levels at regular intervals. Forecasting of glucose trajectories based on CGM data holds the potential to substantially improve diabetes management.
arXiv Detail & Related papers (2024-10-08T08:01:09Z)
Chronic Disease Diagnoses Using Behavioral Data [42.96592744768303]
We aim to diagnose hyperglycemia (diabetes), hyperlipidemia, and hypertension (collectively known as 3H) using own collected behavioral data.
arXiv Detail & Related papers (2024-10-04T12:52:49Z)
Diabetica: Adapting Large Language Model to Enhance Multiple Medical Tasks in Diabetes Care and Management [25.56300105245314]
Large language models (LLMs) have shown promise in various healthcare scenarios, but their effectiveness across diverse diabetes tasks remains unproven.<n>Our study introduced a framework to train and validate diabetes-specific LLMs.
arXiv Detail & Related papers (2024-09-20T03:47:54Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
From Glucose Patterns to Health Outcomes: A Generalizable Foundation Model for Continuous Glucose Monitor Data Analysis [47.23780364438969]
We present GluFormer, a generative foundation model for CGM data that learns nuanced glycemic patterns and translates them into predictive representations of metabolic health.<n>GluFormer generalizes to 19 external cohorts spanning different ethnicities and ages, 5 countries, 8 CGM devices, and diverse pathophysiological states.<n>In a longitudinal study of 580 adults with CGM data and 12-year follow-up, GluFormer identifies individuals at elevated risk of developing diabetes more effectively than blood HbA1C%.
arXiv Detail & Related papers (2024-08-20T13:19:06Z)
Privacy Preserved Blood Glucose Level Cross-Prediction: An Asynchronous Decentralized Federated Learning Approach [13.363740869325646]
Newly diagnosed Type 1 Diabetes (T1D) patients often struggle to obtain effective Blood Glucose (BG) prediction models. We propose "GluADFL", blood Glucose prediction by Asynchronous Decentralized Federated Learning.
arXiv Detail & Related papers (2024-06-21T17:57:39Z)
Multimodal Pretraining of Medical Time Series and Notes [45.89025874396911]
Deep learning models show promise in extracting meaningful patterns, but they require extensive labeled data. We propose a novel approach employing self-supervised pretraining, focusing on the alignment of clinical measurements and notes. In downstream tasks, including in-hospital mortality prediction and phenotyping, our model outperforms baselines in settings where only a fraction of the data is labeled.
arXiv Detail & Related papers (2023-12-11T21:53:40Z)
Remote Medication Status Prediction for Individuals with Parkinson's Disease using Time-series Data from Smartphones [75.23250968928578]
We present a method for predicting the medication status of Parkinson's disease patients using the public mPower dataset. The proposed method shows promising results in predicting three medication statuses objectively.
arXiv Detail & Related papers (2022-07-26T02:08:08Z)
Short Term Blood Glucose Prediction based on Continuous Glucose Monitoring Data [53.01543207478818]
This study explores the use of Continuous Glucose Monitoring (CGM) data as input for digital decision support tools. We investigate how Recurrent Neural Networks (RNNs) can be used for Short Term Blood Glucose (STBG) prediction.
arXiv Detail & Related papers (2020-02-06T16:39:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.