Fugu-MT 論文翻訳(概要): Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

論文の概要: Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

arxiv url: http://arxiv.org/abs/2310.06992v1
Date: Tue, 10 Oct 2023 20:25:30 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-13 01:17:28.739616
Title: Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models
Title（参考訳）: 大型事前学習モデルを用いたゼロショットオープンボキャブラリートラッキング
Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki
Abstract要約: 大規模事前訓練モデルでは、野生の2次元静的画像中の物体の検出とセグメンテーションの進歩が期待できる。このような大規模なトレーニング済みの静的イメージモデルを,オープン語彙のビデオトラッキングに再利用することは可能だろうか? 本稿では,オープンボキャブラリ検出器,セグメンタ,高密度光流推定器を,任意のカテゴリの物体を2Dビデオで追跡・セグメント化するモデルに再構成する。
参考スコア（独自算出の注目度）: 28.304047711166056
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.
Abstract（参考訳）: 物体追跡はロボットの知覚とシーン理解の中心である。トラッキングバイ検出は、特定のオブジェクトカテゴリのオブジェクトトラッキングにおいて、長い間支配的なパラダイムであった。近年,大規模事前学習モデルでは,野生の2次元静止画像における物体や部品の検出・分割が有望な進歩を遂げている。この大規模な事前訓練された静的画像モデルを、ボキャブラリなビデオトラッキングに再利用することは可能か? 本稿では,2dビデオ中の任意のカテゴリの物体を追跡・分割するモデルに,開語彙検出器,セグメンタ,高密度光フロー推定器を応用した。 Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. 伝搬された箱の被写体性スコアと前後方向の光流の一貫性に基づいて対象トラックの終了を決定する。深い特徴マッチングを用いて、オクルージョン間でオブジェクトを再識別する。提案手法は,複数のビデオオブジェクトのセグメンテーションおよびトラッキングベンチマークにおいて高い性能を達成し,データ操作において妥当なトラックを生成可能であることを示す。特に、我々のモデルは、オープンワールドのオブジェクト追跡とセグメンテーションのためのベンチマークであるUVOとBURSTのこれまでの最先端よりも優れています。われわれのアプローチが、将来の研究のためのシンプルで拡張可能なフレームワークになり得ることを願っている。

論文の概要: Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

関連論文リスト