Fugu-MT 論文翻訳(概要): VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

論文の概要: VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

arxiv url: http://arxiv.org/abs/2603.07222v1
Date: Sat, 07 Mar 2026 14:05:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.098398
Title: VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization
Title（参考訳）: VINO: 構造的事前案内による非コンテキストオブジェクトの動画駆動不変性
Authors: Seul-Ki Yeom, Marcel Simon, Eunbin Lee, Tae-Ho Kim,
Abstract要約: 自己教師付き学習(SSL)は急速に進歩しているが、文脈的ショートカットと背景テクスチャと共起統計に基づいて、しばしば過剰に学習される特徴がある。本稿では,高密度映像からロバストな画像エンコーダを学習する教師支援フレームワークであるVINOを提案する。 VINOは34.8 CorLocを達成し、高度に焦点を絞った形状バイアスの表現が、以前の高密度ビデオやモーション誘導SSLベースラインを大幅に上回っている。
参考スコア（独自算出の注目度）: 1.4518460893038065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.
Abstract（参考訳）: 自己教師付き学習(SSL)は急速に進歩しているが、文脈的ショートカットと背景テクスチャと共起統計に基づいて、しばしば過剰に学習される特徴がある。ビデオは時間的変化が豊富にあるが、強いエゴモーションを持つ密集したインザワイルドストリームは、共起トラップを生成する: 前景オブジェクトと背景コンテキストは、一貫性を持って動き、表現をシーンエンコーダに分解させる。そこで我々は,高密度映像から頑健な画像エンコーダを学習する教師学習フレームワークであるVINO(ビデオ駆動型非コンテキストオブジェクトの不変性)を提案する。クラスに依存しない構造を用いることで、ビューを意味的な擬ラベルとして生成するのではなく、VINOは非対称蒸留問題を形成する。教師は、背景が抑制された前景統一ビューから予測し、学生は周囲の状況を維持しながら競合するインスタンスを除去するオブジェクト条件のシーンビューを観察する。これらのターゲットをマスクした蒸留でマッチングすることで、背景の手がかりを信頼できないものにし、対象中心の不変性への表現を推し進める。さらに,トラックマッチングされた物体上での教師によるクロスタイム蒸留による時間的物体の永続性や,マスク誘導による局所的な視界との完全整合性の安定化を図った。 PASCAL VOC上での注目の可視化と非監視対象の発見により,VINOが背景から効果的に遠ざかることが実証された。密集したウォーキングツアー・ヴェニスのビデオでプレトレーニングされたVINOは、34.8 CorLocを達成し、高度に焦点を絞った、形状に偏った表現を与え、それまでの密集したビデオやモーション誘導のSSLベースラインを大幅に上回っている。

論文の概要: VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

関連論文リスト