Fugu-MT 論文翻訳(概要): Unifying Tracking and Image-Video Object Detection

論文の概要: Unifying Tracking and Image-Video Object Detection

arxiv url: http://arxiv.org/abs/2211.11077v2
Date: Sun, 19 Nov 2023 23:45:09 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-22 20:51:44.163208
Title: Unifying Tracking and Image-Video Object Detection
Title（参考訳）: 追尾と映像物体検出の統一化
Authors: Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Yipin Zhou, Xuefei Cao, Sreya Dutta Roy, Ashish Shah, Ser-Nam Lim
Abstract要約: TrIVD (Tracking and Image-Video Detection) は、画像OD、ビデオOD、MOTを1つのエンドツーエンドモデルに統合する最初のフレームワークである。カテゴリラベルの相違やセマンティックな重複に対処するため、TrIVDは対象カテゴリに対する検出/追跡を基礎と理由として定式化している。
参考スコア（独自算出の注目度）: 54.91658924277527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, multi-object tracking (MOT) which requires reasoning about track identities and spatio-temporal trajectories, shares similar spirits with video OD. However, most MOT datasets are class-specific (e.g., person-annotated only), which constrains a model's flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. To handle the discrepancies and semantic overlaps of category labels across datasets, TrIVD formulates detection/tracking as grounding and reasons about object categories via visual-text alignments. The unified formulation enables cross-dataset, multi-task training, and thus equips TrIVD with the ability to leverage frame-level features, video-level spatio-temporal relations, as well as track identity associations. With such joint training, we can now extend the knowledge from OD data, that comes with much richer object category annotations, to MOT and achieve zero-shot tracking capability. Experiments demonstrate that multi-task co-trained TrIVD outperforms single-task baselines across all image/video OD and MOT tasks. We further set the first baseline on the new task of zero-shot tracking.
Abstract（参考訳）: オブジェクト指向検出(OD)はコンピュータビジョンにおける最も基本的なタスクの1つである。近年のディープラーニングの進歩により、画像ODのパフォーマンスは学習ベースのデータ駆動アプローチによって新たな高みへと押し上げられている。一方、video odは、より高価なデータアノテーションのニーズのために、あまり探求されていない。同時に、トラックの同一性や時空間軌跡の推論を必要とするマルチオブジェクト追跡(MOT)も、ビデオODと類似の精神を共有している。しかし、ほとんどのmotデータセットはクラス固有(例えば、person-annotated only)であり、モデルが他のオブジェクトを追跡する柔軟性を制約している。本稿では、画像OD、ビデオOD、MOTを1つのエンドツーエンドモデルで統合する最初のフレームワークであるTrIVD(Tracking and Image-Video Detection)を提案する。データセット間のカテゴリラベルの相違やセマンティックな重複に対処するため、TrIVDはビジュアルテキストアライメントによるオブジェクトカテゴリの検出/追跡を根拠と理由として定式化している。統合された定式化により、クロスデータセット、マルチタスクのトレーニングが可能になり、TrIVDにフレームレベルの特徴、ビデオレベルの時空間関係、およびアイデンティティの関連性を追跡することができる。このような共同トレーニングにより、よりリッチなオブジェクトカテゴリアノテーションを備えたODデータからの知識をMOTに拡張し、ゼロショット追跡機能を実現することができます。実験により、マルチタスクで訓練されたTrIVDは、すべての画像/ビデオODおよびMOTタスクでシングルタスクベースラインを上回っていることが示された。さらに、ゼロショットトラッキングという新しいタスクに、最初のベースラインを設定します。

論文の概要: Unifying Tracking and Image-Video Object Detection

関連論文リスト