Fugu-MT 論文翻訳(概要): CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

論文の概要: CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

arxiv url: http://arxiv.org/abs/2603.24539v1
Date: Wed, 25 Mar 2026 17:14:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.404313
Title: CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition
Title（参考訳）: CliPPER : イベント認識のための長期的術中手術における文脈的ビデオ言語訓練
Authors: Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy,
Abstract要約: CliPPERは、外科的講義ビデオで訓練されたビデオ言語事前学習フレームワークである。本手法は,微細な時間的ビデオテキスト認識のために設計されている。我々のモデルは、複数の公開外科的ベンチマークにまたがる新しい最先端技術を確立する。
参考スコア（独自算出の注目度）: 46.36937077851682
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.
Abstract（参考訳）: ビデオ言語基礎モデルは、幅広いタスクにわたるゼロショットアプリケーションに非常に効果的であることが証明されている。特に困難な領域は、ラベル付きデータが乏しく、複雑な下流作業には正確な時間的理解が必要とされる、術中外科手術領域である。そこで我々はCliPPER (Contextual Video-Language Pretraining on Long-form intraoperative procedures for Event Recognition)を紹介した。本手法は, 経時的ビデオテキスト認識の微粒化を目的として設計され, 長期的手術ビデオにおけるマルチモーダルアライメントを改善するための, 新たな事前訓練戦略がいくつか導入されている。具体的には,時間的および文脈的依存を生かして局所的な映像理解を高めることを目的とした,文脈的ビデオテキストコントラスト学習(VTC_CTX)とクリップ順序予測(COP)事前学習(COP)を提案する。さらに、同じ手術ビデオ内に、ビデオテキストマッチングに対するサイクル一貫性アライメントを組み込んで、双方向の一貫性を強制し、全体的な表現コヒーレンスを改善する。さらに、ビデオフレームとテキストのアライメントを改善するために、より洗練されたアライメント損失であるFrame-Text Matching (FTM)を導入する。その結果, フェーズ, ステップ, 楽器, トリプレットのゼロショット認識を含む, 複数の手術用ベンチマークにおいて, 新たな最先端技術を確立した。ソースコードと事前トレーニングのキャプションはhttps://github.com/CAMMA-public/CliPPERで見ることができる。

関連論文リスト

From Phase Grounding to Intelligent Surgical Narratives [4.047840018793636]
ビデオ手術のタイムラインは、外科医が手術の重要な部分に集中できるため、ツール補助手術の重要な部分である。現在の方法では、外科医が手術後の報告(OP)を記入するが、これはしばしば曖昧であり、手動で手術ビデオに注釈を付ける。本手法は,手術映像から直接,手術スケジュールと物語を自動生成することを目的としている。
論文参考訳（メタデータ） (2026-03-05T22:44:24Z)
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models [9.896951371033229]
VideoPerceiverはビデオ理解における微細な認識を高めるビデオマルチモーダル大言語モデル(VMLLM)である。そこで我々は,キャプションからイベントアクションキーワードを抽出し,対応するキーフレームを識別し,隣接するフレームに置き換えることで,キー情報伝達ビデオを構築する。 VideoPerceiverは、詳細なアクション理解とまれなイベントキャプションベンチマークにおいて、最先端のVMLLMを大幅に上回っている。
論文参考訳（メタデータ） (2025-11-24T06:57:26Z)
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
ビデオ言語事前学習モデルは、ビデオ質問応答タスクの指導において顕著な成功を収めている。ビデオシーケンスの長さのため、大規模なビデオベースモデルのトレーニングは、画像ベースモデルのトレーニングよりもかなりコストがかかる。これは、画像ドメインとビデオドメインの間に明らかなギャップがあるにもかかわらず、画像ベースの事前学習からの知識を活用する動機となります。
論文参考訳（メタデータ） (2023-08-16T15:00:50Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
手術用コンピュータビジョンの応用の最近の進歩は、視覚のみのモデルによって駆動されている。これらの手法は、固定されたオブジェクトカテゴリのセットを予測するために手動で注釈付き手術ビデオに依存する。本研究では,オープンな外科的eラーニングプラットフォームを通じて提供される外科的ビデオ講義が,効果的な視覚と言語監督の信号を提供することができるという考えを提起した。
論文参考訳（メタデータ） (2023-07-27T22:38:12Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
映像モーメント検索(VMR)における視覚とテキストの相関既存の方法は、視覚的およびテキスト的理解のために、個別の事前学習機能抽出器に依存している。本稿では,映像モーメントの理解を促進するために,ビジュアルダイナミックインジェクション(Visual-Dynamic Injection, VDI)と呼ばれる汎用手法を提案する。
論文参考訳（メタデータ） (2023-02-28T19:29:05Z)
Temporal Perceiving Video-Language Pre-training [112.1790287726804]
本研究は、時間的・意味的な微粒なアライメントを可能にする、新しいテキスト-ビデオのローカライゼーション・プレテキストタスクを導入する。具体的には、テキスト-ビデオのローカライゼーションは、テキスト記述が与えられたビデオの開始と終了の境界を予測するモーメント検索から成っている。提案手法は,細粒度フレーム表現と単語表現を結合し,単一モードにおける異なるインスタンスの表現を暗黙的に区別する。
論文参考訳（メタデータ） (2023-01-18T12:15:47Z)
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
ビデオと言語による事前トレーニングは、様々なダウンストリームタスクに有望な改善を示している。 Align and Prompt: クロスモーダルアライメントを改良した,効率的かつ効果的なビデオ・言語事前学習フレームワークを提案する。私たちのコードと事前訓練されたモデルはリリースされます。
論文参考訳（メタデータ） (2021-12-17T15:55:53Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。