Deep Multimodal Fusion for Surgical Feedback Classification
- URL: http://arxiv.org/abs/2312.03231v1
- Date: Wed, 6 Dec 2023 01:59:47 GMT
- Title: Deep Multimodal Fusion for Surgical Feedback Classification
- Authors: Rafal Kocielnik, Elyssa Y. Wong, Timothy N. Chu, Lydia Lin, De-An
Huang, Jiayun Wang, Anima Anandkumar, Andrew J. Hung
- Abstract summary: We leverage a clinically-validated five-category classification of surgical feedback.
We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities.
The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale.
- Score: 70.53297887843802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantification of real-time informal feedback delivered by an experienced
surgeon to a trainee during surgery is important for skill improvements in
surgical training. Such feedback in the live operating room is inherently
multimodal, consisting of verbal conversations (e.g., questions and answers) as
well as non-verbal elements (e.g., through visual cues like pointing to
anatomic elements). In this work, we leverage a clinically-validated
five-category classification of surgical feedback: "Anatomic", "Technical",
"Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine
learning model to classify these five categories of surgical feedback from
inputs of text, audio, and video modalities. The ultimate goal of our work is
to help automate the annotation of real-time contextual surgical feedback at
scale. Our automated classification of surgical feedback achieves AUCs ranging
from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show
that high-quality manual transcriptions of feedback audio from experts improve
AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future
improvements. Empirically, we find that the Staged training strategy, with
first pre-training each modality separately and then training them jointly, is
more effective than training different modalities altogether. We also present
intuitive findings on the importance of modalities for different feedback
categories. This work offers an important first look at the feasibility of
automated classification of real-world live surgical feedback based on text,
audio, and video modalities.
Related papers
- EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.
Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z) - Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment [65.70317151363204]
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings.
In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition.
Our framework integrates voice activity detection, speaker diarization, and automated speech recaognition, with a novel enhancement that removes hallucinations.
arXiv Detail & Related papers (2024-12-01T10:35:12Z) - Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment [66.6041949490137]
We propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness.
Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes.
Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.
arXiv Detail & Related papers (2024-11-17T00:13:00Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - Surgment: Segmentation-enabled Semantic Search and Creation of Visual
Question and Feedback to Support Video-Based Surgery Learning [4.509082876666929]
Surgment is a system that helps expert surgeons create exercises with feedback based on surgery recordings.
The segmentation pipeline enables functionalities to create visual questions and feedback desired by surgeons.
In an evaluation study with 11 surgeons, participants applauded the search-by-sketch approach for identifying frames of interest and found the resulting image-based questions and feedback to be of high educational value.
arXiv Detail & Related papers (2024-02-27T21:42:23Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics.
We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals.
We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - Quantification of Robotic Surgeries with Vision-Based Deep Learning [45.165919577877695]
We propose a unified deep learning framework, entitled Roboformer, which operates exclusively on videos recorded during surgery.
We validated our framework on four video-based datasets of two commonly-encountered types of steps within minimally-invasive robotic surgeries.
arXiv Detail & Related papers (2022-05-06T06:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.