Specialized Foundation Models for Intelligent Operating Rooms
- URL: http://arxiv.org/abs/2505.12890v2
- Date: Fri, 04 Jul 2025 12:31:31 GMT
- Title: Specialized Foundation Models for Intelligent Operating Rooms
- Authors: Ege Özsoy, Chantal Pellegrini, David Bani-Harouni, Kun Yuan, Matthias Keicher, Nassir Navab,
- Abstract summary: We introduce ORQA, a multimodal foundation model unifying visual, auditory, and structured data for holistic surgical understanding.<n>We benchmark ORQA against generalist vision-language models, including ChatGPT and Gemini, and show that while they struggle to perceive surgical scenes, ORQA delivers substantially stronger, consistent performance.<n>This work establishes a foundation for the next wave of intelligent surgical solutions, enabling surgical teams and medical technology providers to create smarter and safer operating rooms.
- Score: 45.775571784374726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical procedures unfold in complex environments demanding coordination between surgical teams, tools, imaging and increasingly, intelligent robotic systems. Ensuring safety and efficiency in ORs of the future requires intelligent systems, like surgical robots, smart instruments and digital copilots, capable of understanding complex activities and hazards of surgeries. Yet, existing computational approaches, lack the breadth, and generalization needed for comprehensive OR understanding. We introduce ORQA, a multimodal foundation model unifying visual, auditory, and structured data for holistic surgical understanding. ORQA's question-answering framework empowers diverse tasks, serving as an intelligence core for a broad spectrum of surgical technologies. We benchmark ORQA against generalist vision-language models, including ChatGPT and Gemini, and show that while they struggle to perceive surgical scenes, ORQA delivers substantially stronger, consistent performance. Recognizing the extensive range of deployment settings across clinical practice, we design, and release a family of smaller ORQA models tailored to different computational requirements. This work establishes a foundation for the next wave of intelligent surgical solutions, enabling surgical teams and medical technology providers to create smarter and safer operating rooms.
Related papers
- Beyond Rigid AI: Towards Natural Human-Machine Symbiosis for Interoperative Surgical Assistance [6.832434059337678]
This work introduces a novel Perception Agent that enables a more natural human-machine interaction in real-time surgical assistance.<n>Our agent offers the flexibility to segment both known and unseen elements in the surgical scene through intuitive interaction.
arXiv Detail & Related papers (2025-07-30T20:42:24Z) - SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement [8.337819078911405]
SurgVisAgent is an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs)<n>It dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks.<n>We construct a benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models.
arXiv Detail & Related papers (2025-07-03T03:00:26Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities [53.76854299076118]
UniversalRAG is a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities.<n>We propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it.<n>We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over various modality-specific and unified baselines.
arXiv Detail & Related papers (2025-04-29T13:18:58Z) - Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems [132.77459963706437]
This book provides a comprehensive overview, framing intelligent agents within modular, brain-inspired architectures.<n>It explores self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities.<n>It also examines the collective intelligence emerging from agent interactions, cooperation, and societal structures.
arXiv Detail & Related papers (2025-03-31T18:00:29Z) - SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot [3.487327636814225]
SurgBox is an agent-driven sandbox framework to enhance cognitive capabilities of surgeons in immersive surgical simulations.<n>In particular, we devise Surgery Copilot, an AI-driven assistant to actively coordinate the surgical information stream and support clinical decision-making.
arXiv Detail & Related papers (2024-12-06T17:07:27Z) - Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering [0.0]
In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance.
Recent advancements in Generative AI have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA)
We propose a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework.
arXiv Detail & Related papers (2024-08-24T19:34:04Z) - GP-VLS: A general-purpose vision language model for surgery [0.5249805590164902]
GP-VLS is a general-purpose vision language model for surgery.
It integrates medical and surgical knowledge with visual scene understanding.
We show GP-VLS significantly outperforms open- and closed-source models on surgical vision-language tasks.
arXiv Detail & Related papers (2024-07-27T17:27:05Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems [1.204357447396532]
Novelty search (NS) refers to a class of exploration algorithms that automatically uncover diverse system behaviors through simulations or experiments.<n>We propose a sample-efficient NS method inspired by Bayesian optimization principles.<n>We show that BEACON comprehensively outperforms existing baselines by finding substantially larger sets of diverse behaviors under limited sampling budgets.
arXiv Detail & Related papers (2024-06-05T20:23:52Z) - VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons [29.783300422432763]
We propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention.
We devise a surgical-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions.
arXiv Detail & Related papers (2024-05-14T02:05:36Z) - ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling [41.30327565949726]
We introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling.
It incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios.
In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models.
arXiv Detail & Related papers (2024-04-10T14:24:10Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z) - Robotic Navigation Autonomy for Subretinal Injection via Intelligent
Real-Time Virtual iOCT Volume Slicing [88.99939660183881]
We propose a framework for autonomous robotic navigation for subretinal injection.
Our method consists of an instrument pose estimation method, an online registration between the robotic and the i OCT system, and trajectory planning tailored for navigation to an injection target.
Our experiments on ex-vivo porcine eyes demonstrate the precision and repeatability of the method.
arXiv Detail & Related papers (2023-01-17T21:41:21Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z) - Multi-objective Asynchronous Successive Halving [10.632606255280649]
We propose algorithms that extend successive asynchronous halving (ASHA) to the multi-objective (MO) setting.
Our empirical analysis shows that MO ASHA enables to perform MO HPO at scale.
Our algorithms establish new baselines for future research in the area.
arXiv Detail & Related papers (2021-06-23T19:39:31Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.