Fugu-MT 論文翻訳(概要): Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

論文の概要: Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

arxiv url: http://arxiv.org/abs/2312.17686v2
Date: Thu, 23 May 2024 15:52:11 GMT
ステータス: 翻訳完了
システム内更新日: 2024-05-25 11:46:15.499329
Title: Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization
Title（参考訳）: マルチスケール・ビジョン・トランスフォーマーが2部マッチングに到達して効率的なワンステージアクション・ローカライゼーション
Authors: Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos,
Abstract要約: アクションローカライゼーションは、しばしば別々に対処される検出タスクと認識タスクを組み合わせた、困難な問題である。両タスクを両パートマッチングでトレーニングした単一のMViTv2-Sアーキテクチャが,RoIで事前計算した有界ボックス上でトレーニングした場合,同一のMViTv2-Sを超えることを示す。
参考スコア（独自算出の注目度）: 27.472705540825316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}
Abstract（参考訳）: アクションローカライゼーション(Action Localization)は、検出タスクと認識タスクを組み合わせた困難な問題である。 State-of-the-artメソッドは、高解像度で事前計算された既成の既成境界ボックス検出に依存し、分類タスクのみに焦点を当てたトランスフォーマーモデルを提案する。このような2段階のソリューションは、リアルタイムデプロイメントでは禁じられている。一方、シングルステージの手法は、ネットワークの一部(一般的にはバックボーン)を作業負荷の大部分を共有に分割することで、両方のタスクをターゲットとすることで、パフォーマンスを向上する。これらの方法は、学習可能なクエリでDETRヘッドを追加することで構築され、クロスアテンションとセルフアテンションの後、対応するMLPに送信して、人のバウンディングボックスとアクションを検出する。しかし、DETRのようなアーキテクチャはトレーニングが困難であり、大きな複雑さを引き起こす可能性がある。本稿では, 視覚変換器の出力トークンに対して, 直列二部整合損失が適用可能であることを観察する。これにより、余分なエンコーダ-デコーダヘッドと学習可能なクエリを必要とせずに両方のタスクを実行できるバックボーン+MPPアーキテクチャが実現される。両タスクを両パートマッチングでトレーニングした単一のMViTv2-Sアーキテクチャが,RoIで事前計算した有界ボックス上でトレーニングした場合,同一のMViTv2-Sを超えることを示す。トークンプーリングとトレーニングパイプラインの注意深い設計により、当社のBipartite-Matching Vision Transformerモデルである \textbf{BMViT} は、AVA2.2上で +3 mAP を達成する。 2段式MViTv2-S。コードは \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT} で公開されている。

論文の概要: Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

関連論文リスト