Fugu-MT 論文翻訳(概要): Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

論文の概要: Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

arxiv url: http://arxiv.org/abs/2511.13784v1
Date: Sun, 16 Nov 2025 09:59:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.718246
Title: Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection
Title（参考訳）: Few-Shot Video Object 検出のための時間物体認識型視覚変換器
Authors: Yogesh Kumar, Anand Mishra,
Abstract要約: Few-shot Video Object Detection (FSVOD) は、限定ラベル付き例でビデオ中の新しいオブジェクトを検出するという課題に対処する。提案手法は,5ショット設定で3.7%(FSVOD-500),5.3%(FSYTV-40),4.3%(VidOR),4.5(VidVRD)のAP改善を実現した。
参考スコア（独自算出の注目度）: 5.263065070942166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: https://github.com/yogesh-iitj/fs-video-vit
Abstract（参考訳）: Few-shot Video Object Detection (FSVOD) は、限られたラベル付き例でビデオ中の新しいオブジェクトを検出するという課題に対処し、広範なトレーニングデータを必要とする従来の検出方法の制約を克服する。この課題は、オクルージョンや外観の変化に影響されたフレーム間の時間的整合性を維持すること、複雑な領域の提案に頼らずに新しいオブジェクトの一般化を実現することなど、計算コストが高く、タスク固有の訓練を必要とする重要な課題を提示する。提案手法は,フレーム間の高信頼度オブジェクト特徴を選択的に伝播するフィルタリング機構を組み込むことにより,これらの課題に対処する。これにより、効率的な特徴の進行、ノイズ蓄積の低減、数ショット設定での検出精度の向上が可能となる。特徴伝搬に焦点をあてた少数ショットの訓練された検出と分類ヘッドを利用することで、明示的な対象管の提案によらず、頑健な時間的整合性を実現する。提案手法は,5ショット設定で3.7%(FSVOD-500),5.3%(FSYTV-40),4.3%(VidOR),4.5(VidVRD)のAP改善を実現した。さらなる結果は、1ショット、3ショット、10ショット構成の改善を示している。 https://github.com/yogesh-iitj/fs-video-vit

論文の概要: Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

関連論文リスト