Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation
- URL: http://arxiv.org/abs/2511.15159v1
- Date: Wed, 19 Nov 2025 06:19:34 GMT
- Title: Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation
- Authors: Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung,
- Abstract summary: High-quality feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition.<n>We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts.
- Score: 66.7752700084159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
Related papers
- UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos [81.9180187964947]
We present UniSurg, a foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction.<n>To enable large-scale pretraining, we curate the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions.<n>These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
arXiv Detail & Related papers (2026-02-05T13:18:33Z) - Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches [5.958100741754613]
We evaluated large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas.<n>We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning.<n>The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79.
arXiv Detail & Related papers (2025-12-05T08:49:57Z) - Explainable Anatomy-Guided AI for Prostate MRI: Foundation Models and In Silico Clinical Trials for Virtual Biopsy-based Risk Assessment [3.5408411348831232]
We present a fully automated, anatomically guided deep learning pipeline for prostate cancer (PCa) risk stratification using routine MRI.<n>The pipeline integrates three key components: an nnU-Net module for segmenting the prostate gland and its zones on axial T2-weighted MRI; a classification module based on the DiceedPT Swin Transformer foundation model, fine-tuned on 3D patches with optional anatomical priors and clinical data; and a VAE-GAN framework for generating counterfactual heatmaps that localize decision-driving image regions.
arXiv Detail & Related papers (2025-05-23T14:40:09Z) - Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment [65.70317151363204]
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings.<n>In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition.<n>Our framework integrates voice activity detection, speaker diarization, and automated speech recaognition, with a novel enhancement that removes hallucinations.
arXiv Detail & Related papers (2024-12-01T10:35:12Z) - Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment [66.6041949490137]
We propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness.
Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes.
Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.
arXiv Detail & Related papers (2024-11-17T00:13:00Z) - Evaluating the Application of ChatGPT in Outpatient Triage Guidance: A Comparative Study [11.37622565068147]
The integration of Artificial Intelligence in healthcare presents a transformative potential for enhancing operational efficiency and health outcomes.
Large Language Models (LLMs), such as ChatGPT, have shown their capabilities in supporting medical decision-making.
This study specifically aims to evaluate the consistency of responses provided by ChatGPT in outpatient guidance.
arXiv Detail & Related papers (2024-04-27T04:12:02Z) - Deep Multimodal Fusion for Surgical Feedback Classification [70.53297887843802]
We leverage a clinically-validated five-category classification of surgical feedback.
We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities.
The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale.
arXiv Detail & Related papers (2023-12-06T01:59:47Z) - Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis [6.712251433139412]
Pretraining vision transformers (ViT) with attention guided masked image modeling (MIM) has shown to increase downstream accuracy for natural image analysis.
We developed a co-distilled Swin transformer that combines a noisy momentum updated teacher to guide selective masking for MIM.
arXiv Detail & Related papers (2023-10-02T13:53:55Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Improving Large Language Models for Clinical Named Entity Recognition
via Prompt Engineering [20.534197056683695]
This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks.
We developed a task-specific prompt framework that includes baseline prompts, annotation guideline-based prompts, error analysis-based instructions, and annotated samples.
We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.
arXiv Detail & Related papers (2023-03-29T02:46:18Z) - Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.