Fugu-MT 論文翻訳(概要): SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

論文の概要: SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

arxiv url: http://arxiv.org/abs/2603.12382v1
Date: Thu, 12 Mar 2026 18:59:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.732346
Title: SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
Title（参考訳）: SPARROW:Pixel-Grounded Video MLLMにおける空間的精度と時間的参照一貫性の学習
Authors: Mohamad Alansari, Naufal Suryanto, Divya Velayudhan, Sajid Javed, Naoufel Werghi, Muzammal Naseer,
Abstract要約: SPARROW(SPARROW)は、2つのキーコンポーネントを通して空間的精度と時間的安定性を統一するビデオMLLMである。 SPARROWは、30,646のビデオと45,231のQ&Aペアからなる、キュレートされた参照ビデオデータセットによってサポートされている。 6つのベンチマークで一貫したゲインを提供し、RVOSでは+8.9 J&F、ビジュアルグラウンドでは+5 mIoU、GCGでは+5.4 CLAIRに改善された。
参考スコア（独自算出の注目度）: 39.73103140338364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、画像レベルの推論からピクセルレベルのグラウンド化へと進歩してきたが、モデルが空間的精度と時間的に一貫した参照追跡を達成する必要があるため、これらの能力をビデオに拡張することは依然として困難である。既存のビデオMLLMは、フレームワイドグラウンドに静的セグメンテーショントークン([SEG])を頼りにしており、セマンティクスを提供するが、時間的コンテキストに欠け、空間的ドリフト、アイデンティティスイッチ、オブジェクトの移動や再出現時に不安定な初期化を引き起こす。 SPARROW(SPARROW)は、空間的精度と時間的安定性を2つのキーコンポーネントで統一するビデオMLLMである。一訓練中に時間的に整列した参照手がかりを注入する目標特定追跡特徴(TSF) (ii)ボックス([BOX])とセグメンテーション([SEG])トークンをデコードして、幾何学的先行をセマンティックグラウンドで融合するデュアルプロンプト設計。 SPARROWは、30,646のビデオと45,231のQ&Aペアのキュレートされた参照ビデオデータセットによってサポートされており、クラスに依存しないSAM2ベースのプロポーザルを介して外部検出器なしでエンドツーエンドで動作する。最新の3つのオープンソースビデオMLLM(UniPixel、GLUS、VideoGLaMM)に統合され、SPARROWは6つのベンチマークで一貫したゲインを提供し、RVOSでは+8.9 J&F、ビジュアルグラウンドでは+5 mIoU、GCGでは+5.4 CLAIRに改善した。これらの結果から,SPARROWは画像理解における参照安定性,空間精度,時間的コヒーレンスを大幅に改善することが示された。プロジェクトページ:https://risys-lab.github.io/SPARROW

論文の概要: SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

関連論文リスト