Fugu-MT 論文翻訳(概要): A Unified Model for Tracking and Image-Video Detection Has More Power

論文の概要: A Unified Model for Tracking and Image-Video Detection Has More Power

arxiv url: http://arxiv.org/abs/2211.11077v1
Date: Sun, 20 Nov 2022 20:30:28 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-22 20:32:44.491467
Title: A Unified Model for Tracking and Image-Video Detection Has More Power
Title（参考訳）: トラッキングと画像ビデオ検出のための統一モデル
Authors: Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Yipin Zhou, Xuefei Cao, Sreya Dutta Roy, Ashish Shah, Ser-Nam Lim
Abstract要約: TrIVD (Tracking and Image-Video Detection) は、画像OD、ビデオOD、MOTを1つのエンドツーエンドモデルに統合する最初のフレームワークである。 TrIVDは,すべての画像/ビデオODおよびMOTタスクに対して,最先端のパフォーマンスを実現する。
参考スコア（独自算出の注目度）: 37.070549984457145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, multi-object tracking (MOT) which requires reasoning about track identities and spatio-temporal trajectories, shares similar spirits with video OD. However, most MOT datasets are class-specific (e.g., person-annotated only), which constrains a model's flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. To handle the discrepancies and semantic overlaps across datasets, TrIVD formulates detection/tracking as grounding and reasons about object categories via visual-text alignments. The unified formulation enables cross-dataset, multi-task training, and thus equips TrIVD with the ability to leverage frame-level features, video-level spatio-temporal relations, as well as track identity associations. With such joint training, we can now extend the knowledge from OD data, that comes with much richer object category annotations, to MOT and achieve zero-shot tracking capability. Experiments demonstrate that TrIVD achieves state-of-the-art performances across all image/video OD and MOT tasks.
Abstract（参考訳）: オブジェクト指向検出(OD)はコンピュータビジョンにおける最も基本的なタスクの1つである。近年のディープラーニングの進歩により、画像ODのパフォーマンスは学習ベースのデータ駆動アプローチによって新たな高みへと押し上げられている。一方、video odは、より高価なデータアノテーションのニーズのために、あまり探求されていない。同時に、トラックの同一性や時空間軌跡の推論を必要とするマルチオブジェクト追跡(MOT)も、ビデオODと類似の精神を共有している。しかし、ほとんどのmotデータセットはクラス固有(例えば、person-annotated only)であり、モデルが他のオブジェクトを追跡する柔軟性を制約している。本稿では、画像OD、ビデオOD、MOTを1つのエンドツーエンドモデルで統合する最初のフレームワークであるTrIVD(Tracking and Image-Video Detection)を提案する。データセット間の差異やセマンティクスの重複に対処するために、trivdは、ビジュアルテキストアライメントによるオブジェクトカテゴリの検出/追跡を根拠として定式化する。統合された定式化により、クロスデータセット、マルチタスクのトレーニングが可能になり、TrIVDにフレームレベルの特徴、ビデオレベルの時空間関係、およびアイデンティティの関連性を追跡することができる。このような共同トレーニングにより、よりリッチなオブジェクトカテゴリアノテーションを備えたODデータからの知識をMOTに拡張し、ゼロショット追跡機能を実現することができます。実験により、TrIVDはすべての画像/ビデオODおよびMOTタスクで最先端のパフォーマンスを達成することが示された。

論文の概要: A Unified Model for Tracking and Image-Video Detection Has More Power

関連論文リスト