Related papers: Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

URL: http://arxiv.org/abs/2510.24332v1
Date: Tue, 28 Oct 2025 11:55:45 GMT
Title: Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
Authors: Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof, Bastian Sigrist, Philipp Fürnstahl, Matthias Seibold,
Abstract summary: This work aims to enhance surgical scene representations by integrating 3D acoustic information.<n>We propose a novel framework for generating 4D audio-visual representations of surgical scenes.<n>The proposed framework enables richer contextual understanding and provides a foundation for future intelligent surgical systems.
Score: 0.5146940511526402
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.

Related papers

Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion [54.359489807885616]
SurgRef is a motion-guided framework that grounds free-form language expressions in instrument motion, rather than what they look like.<n>To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with densetemporal masks and rich motion expressions.
arXiv Detail & Related papers (2026-01-18T02:14:08Z)
SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation [4.97436124491469]
We introduce a speech-guided collaborative perception framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs.<n>A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation.<n> instruments themselves serve as interactive pointers to label additional elements of the surgical scene.
arXiv Detail & Related papers (2025-09-12T23:36:52Z)
SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting [45.16104996137126]
We present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap.<n>We propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features.<n>We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods.
arXiv Detail & Related papers (2025-06-29T15:55:01Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [67.8359850515282]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We show that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z)
Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z)
BASED: Bundle-Adjusting Surgical Endoscopic Dynamic Video Reconstruction using Neural Radiance Fields [5.773068487121897]
Reconstruction of deformable scenes from endoscopic videos is important for many applications.<n>Our work adopts the Neural Radiance Fields (NeRF) approach to learning 3D implicit representations of scenes.<n>We demonstrate this approach on endoscopic surgical scenes from robotic surgery.
arXiv Detail & Related papers (2023-09-27T00:20:36Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception [15.927060244702686]
We present an efficient reconstruction pipeline for highly dynamic surgical scenes that runs at 28 fps. Specifically, we design a transformer-based stereoscopic depth perception for efficient depth estimation. We evaluate the proposed pipeline on two datasets, the public Hamlyn Centre Endoscopic Video dataset and our in-house DaVinci robotic surgery dataset.
arXiv Detail & Related papers (2021-07-01T05:57:41Z)
Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views. We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.