Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins
- URL: http://arxiv.org/abs/2503.21054v1
- Date: Wed, 26 Mar 2025 23:59:32 GMT
- Title: Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins
- Authors: Yiqing Shen, Chenjia Li, Bohan Liu, Cheng-Yi Li, Tito Porras, Mathias Unberath,
- Abstract summary: Analyzing operating room (OR) to derive quantitative insights into OR efficiency is important for hospitals.<n> Reasoning segmentation (RS) based on foundation models offers flexibility by enabling automated analysis of OR improvement from OR video feeds.<n>We present ORDiRS (Operating Room Digital twin representation for Reasoning), an LLM-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm.
- Score: 7.34430213311229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.
Related papers
- Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off [29.48293757752123]
We present a progressive training pipeline that integrates Perception andReasoning capabilities.<n>We identify Temporal Drift in long-context audio, where extended reasoning desynchronizes the model from acoustic timestamps.<n>This report details the architecture, the data-efficient training recipe, and a diagnostic analysis of the trade-offs between robust perception and structured reasoning.
arXiv Detail & Related papers (2026-02-27T06:56:50Z) - Relatron: Automating Relational Machine Learning over Relational Databases [50.94254514286021]
We present a study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks.<n>Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and accuracy is an unreliable guide for choice architecture.
arXiv Detail & Related papers (2026-02-26T02:45:22Z) - Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method [96.63801368613177]
We present a new task that elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning.<n>We present a new dataset with 8,641 videos, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly understanding.<n>Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making.
arXiv Detail & Related papers (2026-01-15T08:09:04Z) - Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization [35.60979104539273]
Nirvana is a multi-modal data analytics framework that incorporates programmable semantic operators.<n>Nirvana is able to reduce end-to-end runtime by 10%--85% and reduces system processing costs by 76% on average.
arXiv Detail & Related papers (2025-11-25T01:41:49Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models [0.0]
This paper addresses the gap in leveraging modern large language models (LLMs) for more complex reasoning tasks.<n>We introduce an exploratory framework that uses LLMs to move beyond classification and perform semantic analysis.<n>The results demonstrate that LLMs can function as powerful "reliability co-pilots," moving beyond labelling to synthesise textual information and actionable, expert-level hypotheses.
arXiv Detail & Related papers (2025-09-26T14:00:20Z) - I2I-STRADA -- Information to Insights via Structured Reasoning Agent for Data Analysis [0.0]
Real-world data analysis requires a consistent cognitive workflow.<n>We introduce I2I-STRADA, an agentic architecture designed to formalize this reasoning process.
arXiv Detail & Related papers (2025-07-23T18:58:42Z) - Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance [1.2749527861829046]
Our framework integrates Knowledge Graphs (KGs) and Large Language Model (LLM)-based agents.<n>It transforms raw DES data into a semantically rich KG, capturing relationships between simulation events and entities.<n>An LLM-based agent uses iterative reasoning, generating interdependent sub-questions. For each sub-question, it creates Cypher queries for KG interaction, extracts information, and self-reflects to correct errors.
arXiv Detail & Related papers (2025-07-23T07:18:55Z) - Efficient Medical VIE via Reinforcement Learning [10.713109515157475]
Visual Information Extraction (VIE) converts unstructured document images into structured formats like, structured formats like, critical for medical applications like report analysis and online consultations.<n>Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct generation.<n>We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples.
arXiv Detail & Related papers (2025-06-16T11:10:25Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Privacy-Preserving Operating Room Workflow Analysis using Digital Twins [38.744671293771695]
We propose a two-stage pipeline for privacy-preserving operating room (OR) video analysis and event detection.
In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos.
In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection.
arXiv Detail & Related papers (2025-04-17T00:46:06Z) - SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence [16.584722724845182]
Integration of Vision-Language Models in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies.<n>We present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery.
arXiv Detail & Related papers (2025-03-13T11:23:13Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>We introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships.<n>Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model's ability to reject non-existent targets.
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition [71.61103962200666]
Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora.
Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates.
We introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER.
arXiv Detail & Related papers (2025-02-25T23:30:43Z) - Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents [65.36060818857109]
We present a novel framework for extracting and evaluating dialog from historical interactions.
Our extraction process consists of two key stages: (1) a retrieval step to select relevant conversations based on key procedural elements, and (2) a structured workflow generation process using a question-answer-based chain-of-thought (QA-CoT) prompting.
arXiv Detail & Related papers (2025-02-24T16:55:15Z) - Research on the Application of Spark Streaming Real-Time Data Analysis System and large language model Intelligent Agents [1.4582633500696451]
This study explores the integration of Agent AI with LangGraph to enhance real-time data analysis systems in big data environments.<n>The proposed framework overcomes limitations of static, inefficient stateful computations, and lack of human intervention.<n>System architecture incorporates Apache Spark Streaming, Kafka, and LangGraph to create a high-performance sentiment analysis system.
arXiv Detail & Related papers (2024-12-10T05:51:11Z) - DiSCo Meets LLMs: A Unified Approach for Sparse Retrieval and Contextual Distillation in Conversational Search [19.694957365385896]
Conversational Search (CS) is the task of retrieving relevant documents from a corpus within a conversational context.
Current methods address this by distilling embeddings from human-rewritten queries to learn the context modeling task.
We propose a new distillation method, as a relaxation of the previous objective, unifying retrieval and context modeling.
arXiv Detail & Related papers (2024-10-18T17:03:17Z) - Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering [0.0]
In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance.
Recent advancements in Generative AI have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA)
We propose a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework.
arXiv Detail & Related papers (2024-08-24T19:34:04Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - Analytical Engines With Context-Rich Processing: Towards Efficient
Next-Generation Analytics [12.317930859033149]
We envision an analytical engine co-optimized with components that enable context-rich analysis.
We aim for a holistic pipeline cost- and rule-based optimization across relational and model-based operators.
arXiv Detail & Related papers (2022-12-14T21:46:33Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.