Fugu-MT 論文翻訳(概要): Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

論文の概要: Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

arxiv url: http://arxiv.org/abs/2605.11208v2
Date: Sat, 16 May 2026 12:46:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 23:51:08.268976
Title: Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
Title（参考訳）: Hi-GaTA: 手術映像生成のための階層的ゲート付き時間アグリゲーションアダプタ
Authors: Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang,
Abstract要約: 手術報告生成のための知覚推論フレームワークを提案し, 軽量な時間適応型HiGa-TAを特徴とする。実験により,提案手法は,MLLMベースラインよりも一貫したゲインを達成し,全体的な性能を向上することを示す。
参考スコア（独自算出の注目度）: 7.606404030331724
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.
Abstract（参考訳）: 手術手順の自動化された臨床レベルの評価報告は、ドキュメントの負担を軽減し、客観的なフィードバックを提供するが、高密度の時空間ビデオ表現と言語に基づく推論との整合が困難であり、高品質でプライバシー保護されたデータセットの不足のため、難しいままである。このギャップに対処するために,214本の高品質なシミュレートされた手術用ビデオと,外科医による評価報告を併用したベンチマークを構築した。このリソースをベースとして, 短時間から長期の時間的アグリゲーションにより, 長いビデオ列をコンパクトかつLLM互換の視覚的接頭辞トークンに効率よく圧縮する, Hi-GaTA を特徴とする, 手術用ビデオレポート生成のための知覚アライメント・推論フレームワークを提案する。堅牢な視覚知覚のために,手術固有のViViTスタイルのビデオエンコーダであるSur40kを4万分間の公開手術ビデオでプレトレーニングし,より微細な時空間のプロシージャ前兆を捉えた。 Hi-GaTAは、テキスト条件のデュアルアテンションを持つ時間ピラミッドを採用し、クロスレベルゲート融合によるマルチスケール一貫性と、深層戦略の向上を実現している。最後に, LLMバックボーンをLoRAを用いて微調整し, 厳密でスタイリスティックに整合性のある手術報告生成を可能にする。実験により,マルチモーダル大規模言語モデル (MLLM) ベースラインよりも一貫したゲインが得られた。アブレーション研究は、提案する各成分の有効性をさらに検証する。

論文の概要: Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

関連論文リスト