Fugu-MT 論文翻訳(概要): SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

論文の概要: SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

arxiv url: http://arxiv.org/abs/2603.13024v1
Date: Fri, 13 Mar 2026 14:32:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.120086
Title: SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation
Title（参考訳）: SAW:制御可能でスケーラブルなビデオ生成による手術行動世界モデルに向けて
Authors: Sampath Rapuri, Lalithkumar Seenivasan, Dominik Schneider, Roger Soberanis-Mukul, Yufan He, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Pengfei Guo, Daguang Xu, Mathias Unberath,
Abstract要約: リアルな外科的アクションビデオを生成することができる外科的世界モデルは、外科的AIとシミュレーションの根本的な課題に対処することができる。現在のビデオ生成法は、推論時の条件付け信号として高価なアノテーションや複雑な構造化中間体を必要とする。手術行動世界(SAW)は,4つの軽量信号を用いた映像拡散条件による手術行動モデリングに向けてのステップである。
参考スコア（独自算出の注目度）: 13.94653131033701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation -- from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) -- a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.
Abstract（参考訳）: ツールとタスクのインタラクションを正確に制御したリアルな外科的アクションビデオを生成することができる外科的世界モデルは、データ不足やまれなイベント合成から、外科的自動化のためのシミュレートと現実のギャップを埋めることに至るまで、外科的AIとシミュレーションの基本的な課題に対処することができる。しかし、そのような外科的世界モデルの中核である現在のビデオ生成手法は、推論時の条件付け信号として高価なアノテーションや複雑な構造化中間体を必要とし、スケーラビリティを制限している。他のアプローチでは、複雑な腹腔鏡のシーンで時間的一貫性が限られており、十分なリアリズムを持っていない。手術行動世界(SAW) - ツールアクションコンテキストを符号化する言語プロンプト、参照外科シーン、組織余裕マスク、および2Dツールチップトラジェクトリーの4つの軽量信号を用いたビデオ拡散条件による手術行動世界モデリングに向けたステップを提案する。我々は,映像間拡散を軌跡条件付き外科的動作合成に変換する条件付きビデオ拡散アプローチを設計する。バックボーン拡散モデルは,12,044本の腹腔鏡的クリップを軽量な時空間条件信号でカスタマイズしたデータセットで微調整し,深度整合性損失を利用して推定の深度を必要とせずに幾何的可視性を強制する。 SAWは、最先端の時間一貫性(CD-FVD: 199.19 vs. 546.82)と、保持されたテストデータに対する強力な視覚的品質を達成する。さらに、下流のユーティリティを実演する。 (a)SAW生成ビデオによるまれな行動の増大は、実際のテストデータ上での行動認識(F1スコア:20.93%から43.14%、カット:0.00%から8.33%)を改善し、 b) 手術シミュレーションでは, シミュレータ由来の軌跡から視覚的に忠実なシミュレーションエンジンへ向けて, ツール間相互作用ビデオのレンダリングを行う。

論文の概要: SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

関連論文リスト