A Locally Executable AI System for Improving Preoperative Patient Communication: A Multi-Domain Clinical Evaluation
- URL: http://arxiv.org/abs/2510.01671v1
- Date: Thu, 02 Oct 2025 04:53:11 GMT
- Title: A Locally Executable AI System for Improving Preoperative Patient Communication: A Multi-Domain Clinical Evaluation
- Authors: Motoki Sato, Yuki Matsushita, Hidekazu Takahashi, Tomoaki Kakazu, Sou Nagata, Mizuho Ohnuma, Atsushi Yoshikawa, Masayuki Yamamura,
- Abstract summary: LENOHA is a safety-first, local-first system that routes inputs with a high-precision sentence-transformer constraints.<n>It returns verbatim answers from a clinician-curated FAQ for clinical queries.<n>Energy logging shows that the non-generative clinical path consumes 1.0 mWh per input versus 168 mWh per small-talk reply.
- Score: 1.9205944025326396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Patients awaiting invasive procedures often have unanswered pre-procedural questions; however, time-pressured workflows and privacy constraints limit personalized counseling. We present LENOHA (Low Energy, No Hallucination, Leave No One Behind Architecture), a safety-first, local-first system that routes inputs with a high-precision sentence-transformer classifier and returns verbatim answers from a clinician-curated FAQ for clinical queries, eliminating free-text generation in the clinical path. We evaluated two domains (tooth extraction and gastroscopy) using expert-reviewed validation sets (n=400/domain) for thresholding and independent test sets (n=200/domain). Among the four encoders, E5-large-instruct (560M) achieved an overall accuracy of 0.983 (95% CI 0.964-0.991), AUC 0.996, and seven total errors, which were statistically indistinguishable from GPT-4o on this task; Gemini made no errors on this test set. Energy logging shows that the non-generative clinical path consumes ~1.0 mWh per input versus ~168 mWh per small-talk reply from a local 8B SLM, a ~170x difference, while maintaining ~0.10 s latency on a single on-prem GPU. These results indicate that near-frontier discrimination and generation-induced errors are structurally avoided in the clinical path by returning vetted FAQ answers verbatim, supporting privacy, sustainability, and equitable deployment in bandwidth-limited environments.
Related papers
- Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems [19.880569341968023]
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety.<n>We propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics.
arXiv Detail & Related papers (2026-01-21T16:40:41Z) - CPGPrompt: Translating Clinical Guidelines into LLM-Executable Decision Support [18.887576751340884]
We develop and validate CPGPrompt, an auto-prompting system that converts narrative clinical guidelines into large language models (LLMs)<n>Our framework translates CPGs into structured decision trees and utilizes an LLM to dynamically navigate them for patient case evaluation.<n>System performance was assessed on both binary specialty-referral decisions and fine-grained pathway-classification tasks.
arXiv Detail & Related papers (2026-01-07T00:05:42Z) - A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z) - Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation [5.555479009357263]
Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload.<n>Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes; (2) we develop a retrieval-augmented evaluation pipeline; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection.
arXiv Detail & Related papers (2025-09-26T16:42:43Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering [3.3260862557368926]
We present Neural, the runner-up in the BioNLP 2025 Arch-QA shared task on evidence-grounded clinical QA.<n>Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations.<n>A self-consistency voting scheme further improves evidence recall without sacrificing precision.
arXiv Detail & Related papers (2025-06-12T14:36:18Z) - Uncertainty-guided annotation enhances segmentation with the human-in-the-loop [5.669636524329784]
Uncertainty-Guided.
(UGA) introduces a human-in-the-loop approach, enabling AI to convey its uncertainties to clinicians.
UGA eases this interaction by quantifying uncertainty at the pixel level, thereby revealing the model's limitations.
To foster broader application and community contribution, we have made our code accessible.
arXiv Detail & Related papers (2024-02-16T16:41:15Z) - Exploiting prompt learning with pre-trained language models for
Alzheimer's Disease detection [70.86672569101536]
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression.
This paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function.
arXiv Detail & Related papers (2022-10-29T09:18:41Z) - An End-to-End Set Transformer for User-Level Classification of
Depression and Gambling Disorder [24.776445591293186]
This work proposes a transformer architecture for user-level classification of gambling addiction and depression.
We process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level.
Our architecture is interpretable with modern feature attribution methods and allows for automatic dataset creation.
arXiv Detail & Related papers (2022-07-02T06:40:56Z) - Controlling False Positive/Negative Rates for Deep-Learning-Based
Prostate Cancer Detection on Multiparametric MR images [58.85481248101611]
We propose a novel PCa detection network that incorporates a lesion-level cost-sensitive loss and an additional slice-level loss based on a lesion-to-slice mapping function.
Our experiments based on 290 clinical patients concludes that 1) The lesion-level FNR was effectively reduced from 0.19 to 0.10 and the lesion-level FPR was reduced from 1.03 to 0.66 by changing the lesion-level cost.
arXiv Detail & Related papers (2021-06-04T09:51:27Z) - DDANet: Dual Decoder Attention Network for Automatic Polyp Segmentation [0.3734402152170273]
We propose a novel architecture called DDANet'' based on a dual decoder attention network.
Experiments demonstrate that the model trained on the Kvasir-SEG dataset and tested on an unseen dataset achieves a dice coefficient of 0.7874, mIoU of 0.7010, recall of 0.7987, and a precision of 0.8577.
arXiv Detail & Related papers (2020-12-30T17:52:35Z) - Collaborative residual learners for automatic icd10 prediction using
prescribed medications [45.82374977939355]
We propose a novel collaborative residual learning based model to automatically predict ICD10 codes employing only prescriptions data.
We obtain multi-label classification accuracy of 0.71 and 0.57 of average precision, 0.57 and 0.38 of F1-score and 0.73 and 0.44 of accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:07:27Z) - Ensemble model for pre-discharge icd10 coding prediction [45.82374977939355]
We propose an ensemble model incorporating multiple clinical data sources for accurate code predictions.
We obtain multi-label classification accuracies of 0.73 and 0.58 for average precision, 0.56 and 0.35 for F1-scores and 0.71 and 0.4 accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:02:56Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.