Fugu-MT 論文翻訳(概要): Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

論文の概要: Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

arxiv url: http://arxiv.org/abs/2604.12346v1
Date: Tue, 14 Apr 2026 06:32:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.288077
Title: Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization
Title（参考訳）: ビデオにおける地上DINOの可能性の解錠:空間的空間的局所化のためのパラメータ効率の良い適応
Authors: Zanyi Wang, Fan Li, Dengyang Jiang, Liuzhuozheng Li, Yunhua Zhong, Guang Dai, Mengmeng Wang,
Abstract要約: 本稿では,事前学習した2次元視覚言語モデルをビデオタスクに適用する,データ効率のよいフレームワークST-GDを紹介する。小さなデータセットで事前訓練された事前データを破壊しないように、ST-GDはベースモデルを凍結させ、軽量アダプタを戦略的に注入する。 ST-GDはデータスカースシナリオに優れ、限定スケールのHC-STVG v1/v2ベンチマークで高い競争性能を達成する。
参考スコア（独自算出の注目度）: 24.301393950423897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.
Abstract（参考訳）: 時空間ビデオグラウンドティング(STVG)は、動的ビデオセグメント内のクエリ対象をローカライズすることを目的としている。十分に訓練されたアプローチは、データ不足で悪名高い。しかし、大規模なSTVGデータの収集は非常に困難であり、特に特殊なビデオドメインでは、フレームレベルの密集したバウンディングボックスと複雑な時間的言語アライメントは、アノテートが違法に高価である。その結果、従来のモデルはこれらの本質的に制限されたデータセットに対して厳しいオーバーフィッティングに苦しむ一方、ゼロショット基礎モデルは正確なローカライゼーションに必要なタスク固有の時間的認識を欠いている。この小さなデータ課題を解決するために、ビデオタスクに事前訓練された2次元視覚言語モデル(例えば、Grounding DINO)を適用するデータ効率のフレームワークST-GDを導入する。小さなデータセット上で事前トレーニングされた事前データを破壊しないように、ST-GDはベースモデルを凍結させ、境界予測のための新しい時間デコーダとともに、時空間認識を具現化するために軽量アダプタ(約10Mのトレーニング可能なパラメータ)を戦略的に注入する。この設計はデータ不足に自然に対処する。その結果、ST-GDはデータスカースシナリオに優れ、限られたスケールのHC-STVG v1/v2ベンチマークで高い競争性能を達成し、VidSTGデータセットの堅牢な一般化を維持している。これにより、ST-GDは厳密な小データ制約の下で複雑なビデオ理解のための強力なパラダイムとして検証される。

論文の概要: Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

関連論文リスト