Fugu-MT 論文翻訳(概要): M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

論文の概要: M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

arxiv url: http://arxiv.org/abs/2603.17813v1
Date: Wed, 18 Mar 2026 15:06:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.772781
Title: M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking
Title（参考訳）: M2P:Dense Point Trackingのためのマスク・ツー・ポイント改良学習によるビジュアルファウンデーションモデルの改善
Authors: Qiangqiang Wu, Tianyu Yang, Bo Fang, Jia Wan, Matias Di Martino, Guillermo Sapiro, Antoni B. Chan,
Abstract要約: ビデオ理解の基本的なツールとして、Tracking Any Point (TAP)が登場した。現在のアプローチでは、オフラインの微調整やテストタイムの最適化を通じて、DINOv2のようなビジョンファウンデーションモデル(VFM)を適用している。本稿では、リッチビデオオブジェクトセグメンテーション(VOS)マスクアノテーションを利用して、高密度点追跡のためのVFMを改善するMask-to-Point(M2P)学習を提案する。
参考スコア（独自算出の注目度）: 57.6064636075148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.
Abstract（参考訳）: ビデオ理解の基本的なツールとして、Tracking Any Point (TAP)が登場した。現在のアプローチでは、オフラインの微調整やテストタイムの最適化を通じて、DINOv2のようなビジョンファウンデーションモデル(VFM)を適用している。しかし、これらのVFMは静止画像事前学習に依存しており、ビデオ中の高密度時間対応を捉えるのに本質的に最適である。そこで本研究では,リッチビデオオブジェクトセグメンテーション(VOS)マスクアノテーションを活用して,高密度点追跡のためのVFMを改善するMask-to-Point(M2P)学習を提案する。我々のM2Pは、弱教師付き表現学習のための3つの新しいマスクベースの制約を導入している。まず,局所構造内の点の凝集運動をモデル化するためにProcrustes解析を利用する局所構造整合性損失を提案し,より信頼性の高い点対点マッチング学習を実現する。第2に, 前景点をフレーム間の前景領域と厳密に一致させるマスクラベル整合性(MLC)損失を提案する。提案したLC損失は、トレーニングを安定させ、自明な解への収束を防ぐ正則化と見なすことができる。最後に、マスク境界制約を適用して境界点を明示的に監督する。弱教師付きM2Pモデルは,3.6K VOSトレーニングビデオのみを用いて,効率的なトレーニングを行い,ベースラインVFMよりも有意に優れていた。特にM2Pは、TAP-Vid-DAVISベンチマークでそれぞれDINOv2-B/14とDINOv3-B/16よりも12.8%、14.6%の性能向上を達成した。さらに、提案したM2Pモデルは、テスト時間最適化およびオフライン微調整されたTAPタスクのトレーニング済みバックボーンとして使用され、ポイントトラッキングのための一般的なトレーニング済みモデルとして機能する可能性を示している。コードは受理時に公開される。

論文の概要: M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

関連論文リスト