Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology
- URL: http://arxiv.org/abs/2504.00061v1
- Date: Mon, 31 Mar 2025 14:09:53 GMT
- Title: Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology
- Authors: Dou Liu, Ying Long, Sophia Zuoqiu, Tian Tang, Rong Yin,
- Abstract summary: Effective physician-patient communications are critical but consume a lot of time and, therefore, cause clinic to become inefficient.<n>Recent advancements in Large Language Models (LLMs) offer a potential solution for automating medical history-taking and improving diagnostic accuracy.<n>An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini.<n>Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy.
- Score: 4.48731404829722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p > 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $\alpha$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.
Related papers
- Can Reasoning LLMs Enhance Clinical Document Classification? [7.026393789313748]
Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task.
This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat)
Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%)
arXiv Detail & Related papers (2025-04-10T18:00:27Z) - Urinary Tract Infection Detection in Digital Remote Monitoring: Strategies for Managing Participant-Specific Prediction Complexity [43.108040967674185]
Urinary tract infections (UTIs) are a significant health concern, particularly for people living with dementia (PLWD)
This study builds on previous work that utilised machine learning (ML) to detect UTIs in PLWD.
arXiv Detail & Related papers (2025-02-18T12:01:55Z) - Unlocking Historical Clinical Trial Data with ALIGN: A Compositional Large Language Model System for Medical Coding [44.01429184037945]
We introduce ALIGN, a novel compositional LLM-based system for automated, zero-shot medical coding.<n>We evaluate ALIGN on harmonizing medication terms into Anatomical Therapeutic Chemical (ATC) and medical history terms into Medical Dictionary for Regulatory Activities (MedDRA) codes.
arXiv Detail & Related papers (2024-11-20T09:59:12Z) - Towards Equitable ASD Diagnostics: A Comparative Study of Machine and Deep Learning Models Using Behavioral and Facial Data [2.6353853440763113]
Autism Spectrum Disorder (ASD) is often underdiagnosed in females due to gender-specific symptom differences.
This study evaluates machine learning models, particularly Random Forest and convolutional neural networks, for enhancing ASD diagnosis.
arXiv Detail & Related papers (2024-11-08T05:26:04Z) - Improving Clinical Documentation with AI: A Comparative Study of Sporo AI Scribe and GPT-4o mini [0.0]
Sporo Health's AI scribe was evaluated against OpenAI's GPT-4o Mini.
Results show that Sporo AI consistently outperformed GPT-4o Mini, achieving higher recall, precision, and overall F1 scores.
arXiv Detail & Related papers (2024-10-20T22:48:40Z) - Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety.
This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z) - Towards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design, External Validation, and Continual Learning [5.940140611616894]
AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical and diverse populations.
This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and classification severity.
arXiv Detail & Related papers (2024-09-23T15:01:09Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - DDxT: Deep Generative Transformer Models for Differential Diagnosis [51.25660111437394]
We show that a generative approach trained with simpler supervised and self-supervised learning signals can achieve superior results on the current benchmark.
The proposed Transformer-based generative network, named DDxT, autoregressively produces a set of possible pathologies, i.e., DDx, and predicts the actual pathology using a neural network.
arXiv Detail & Related papers (2023-12-02T22:57:25Z) - More Reliable AI Solution: Breast Ultrasound Diagnosis Using Multi-AI
Combination [1.3357122589980752]
Existing machines embedded in the AI system do not reach the accuracy that clinicians hope.
Super-resolution network reduces unclearness of ultrasound images caused by the device itself.
Two methods for transforming the target model into a classification model are proposed.
arXiv Detail & Related papers (2021-01-07T17:19:00Z) - Ensemble model for pre-discharge icd10 coding prediction [45.82374977939355]
We propose an ensemble model incorporating multiple clinical data sources for accurate code predictions.
We obtain multi-label classification accuracies of 0.73 and 0.58 for average precision, 0.56 and 0.35 for F1-scores and 0.71 and 0.4 accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:02:56Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.