Fugu-MT 論文翻訳(概要): SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

論文の概要: SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

arxiv url: http://arxiv.org/abs/2605.21132v1
Date: Wed, 20 May 2026 13:04:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.684718
Title: SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
Title（参考訳）: SurgOnAir:階層型対応のリアルタイムビデオ解説
Authors: Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi,
Abstract要約: SurgOnAirは、将来のアクセスなしにフレームを逐次処理し、ビジュアル入力が到着するとナレーショントークンを生成するビジョン言語モデルである。このモデルは、外科手術の固有の階層を反映した多段階のテキスト応答を生成するように訓練されている。実験によると、SurgOnAirは、手術ワークフローの複数の階層にわたるストリーミングを統合する単一の視覚言語モデルを通じて、リアルタイムの理解を可能にする。
参考スコア（独自算出の注目度）: 44.963317589774284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.
Abstract（参考訳）: 外科的ワークフローをリアルタイムで理解することは、インテリジェントな外科的実施の基礎であり、手術が進むにつれて、AIシステムは継続的に知覚され、反応する。手術室では、重要な決定は、微細な楽器の動きや組織状態の進化など、微妙で瞬間的な変化に依存する。しかし、既存のメソッドはオフラインのまま、あるいは粗い時間スケールで動作し、クリップ処理後にのみ記述を生成し、即時反応を防止している。 SurgOnAirは、将来のアクセスなしにフレームを逐次処理し、視覚的な入力が到着するにつれて徐々にナレーショントークンを生成するストリーミングビジョン言語モデルである。 SurgOnAirは、微粒なフレーム・ツー・トケン生成を実現し、外科的ダイナミクスの進化に対する即時応答性を実現する。得られた階層的データセットSurgOnAir-11kの動作-、ステップ-、フェーズレベルの監視に基づいて、このモデルは、外科手術の固有の階層を反映した多段階のテキスト応答を生成するように訓練されている。さらに、状態変更を明示的にマークするために、特別なトランジショントークンが生成される。実験によると、SurgOnAirは、手術ワークフローの複数の階層にわたるストリーミングを統一する単一の視覚言語モデルを通じて、リアルタイムの理解を可能にし、優れた階層認識のナレーションを生成する。コードとデータセットは公開されます。

論文の概要: SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

関連論文リスト