Fugu-MT 論文翻訳(概要): SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

論文の概要: SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

arxiv url: http://arxiv.org/abs/2606.20140v1
Date: Thu, 18 Jun 2026 12:03:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.838087
Title: SA-VIS: Sparse frame Annotations for training Video Instance Segmentation
Title（参考訳）: SA-VIS:ビデオインスタンスセグメンテーションのトレーニング用スパースフレームアノテーション
Authors: Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool,
Abstract要約: 最近のオンラインビデオインスタンスセグメンテーション(VIS)手法は印象的な成果を上げており、ビデオ内のセグメントインスタンスに対して好まれるアプローチとなっている。しかし、VISのトレーニング設定は、計算だけでなく、高密度なアノテーションも必要である。ビデオにおけるインスタンスの効果的なモデリングとそれらの進化は、高密度に注釈付けされたフレームを必要としない、と我々は主張する。このシンプルで低スループットなモジュールは、端から端までのトレーニングにスパースビデオフレームラベルを使用することで、膨大な学習能力を提供します。
参考スコア（独自算出の注目度）: 58.2561806729996
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.
Abstract（参考訳）: 最近のオンラインビデオインスタンスセグメンテーション(VIS)手法は印象的な成果を上げており、ビデオ内のセグメントインスタンスに対して好まれるアプローチとなっている。印象的なシングルイメージモデルの復活にもかかわらず、オンライン(または半オンライン)のVISは、トレーニング中に高密度の注釈付きフレームの長いシーケンスを使用することで、シングルイメージモデル(例えばSAM)よりも優れたパフォーマンスを実現している。しかし、このようなVISのトレーニングセットアップは、計算だけでなく、高密度なアノテーションを必要とするという意味でも高価である。これらの大きな欠陥を解決するために、ビデオにおけるインスタンスの効果的なモデリングとそれらの進化は、高密度に注釈付けされたフレームを必要としないと論じる。そこで我々は,複数フレームの画像エンコーダから低次元特徴を集約する,Paste-frames Feature Propagation (PFP) と呼ばれるシンプルで効果的なモジュールを提案する。このシンプルで低スループットなモジュールは、端から端までのトレーニングにスパースビデオフレームラベルを使用することで、膨大な学習能力を提供します。 Sparse frame Annotation VIS(SA-VIS)は、軽量フレーム固有のインスタンスクエリと組み合わせることで、ベースラインのパフォーマンスを大幅に改善します。最も興味深いのは、複雑さを避けるシンプルな設計は、疎密なビデオシーケンスのトレーニングと密接な注釈付きビデオシーケンスの精度のギャップを効果的に埋めるということです。これは、データセット内の画像の1/5にのみアノテーションを使用する場合、SA-VISのパフォーマンスがわずか0.4%低下したことを意味する。経験的に、SA-VISは、YouTube-VIS 2019/2021/2022とOccluded VIS(OVIS)のベースラインに対する強力な改善と、制限されたアノテーションシナリオにおける最先端のAPに対する1%以上の改善を示している。

論文の概要: SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

関連論文リスト