Fugu-MT 論文翻訳(概要): Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

論文の概要: Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

arxiv url: http://arxiv.org/abs/2603.25778v1
Date: Thu, 26 Mar 2026 14:06:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.210345
Title: Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Title（参考訳）: フォーカス・ツー・パーセプティブな表現学習:内視鏡的映像解析のための認知型階層型フレームワーク
Authors: Yuan Zhang, Sihao Dou, Kai Hu, Shuhua Deng, Chunhong Cao, Fen Xiao, Xieping Gao,
Abstract要約: 臨床検査をエミュレートする認知に触発された階層的枠組みであるフォーカス・ツー・パーセプティブ・ラーニング(FPRL)を提案する。 FPRLは最初、静的セマンティクスを学ぶためにフレーム内病変中心の領域に焦点を当て、フレーム間の進化を知覚してコンテキストセマンティクスをモデル化する。 11の内視鏡的ビデオデータセットの実験により、FPRLは様々な下流タスクで優れたパフォーマンスを達成することが示された。
参考スコア（独自算出の注目度）: 18.349979396713646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at https://github.com/MLMIP/FPRL.
Abstract（参考訳）: 内視鏡的画像解析は早期の消化管スクリーニングには不可欠であるが,高品質なアノテーションが不足している。自己監督型ビデオ事前トレーニングは有望であるが、自然ビデオ用に開発された既存の方法は、高密度な時空間モデリングを優先し、運動バイアスを示し、臨床的な意思決定に不可欠な静的な構造的意味論を見越す。この課題に対処するために,臨床検査をエミュレートする認知に着想を得た階層的枠組みであるフォーカス・ツー・パーセプティブ・表現学習(FPRL)を提案する。 FPRLは最初、静的セマンティクスを学ぶためにフレーム内病変中心の領域に焦点を当て、フレーム間の進化を知覚してコンテキストセマンティクスをモデル化する。これを実現するため、FPRLは階層的セマンティックモデリング機構を採用し、両者のセマンティックスを明確に区別し、協調的に学習する。具体的には、教師優先適応マスキング(TPAM)とマルチビュースパースサンプリングを組み合わせた静的セマンティクスのキャプチャから始める。このアプローチは冗長な時間的依存関係を緩和し、モデルが病変に関連する局所的意味論に集中できるようにする。その後、コンテキスト意味論は、クロスビューマスク付き特徴補完(CVMFC)と注意誘導時間予測(AGTP)によって導かれる。これらのプロセスは、クロスビュー対応を確立し、フレーム間の進化を効果的にモデル化し、グローバルな文脈整合性を維持しながら、時間的意味的連続性を補強する。 11の内視鏡ビデオデータセットに対する大規模な実験により、FPRLは様々な下流タスクにまたがって優れたパフォーマンスを達成し、内視鏡ビデオ表現学習の有効性を示す。コードはhttps://github.com/MLMIP/FPRLで公開されている。

論文の概要: Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

関連論文リスト