Fugu-MT 論文翻訳(概要): An End-to-End Framework for Video Multi-Person Pose Estimation

論文の概要: An End-to-End Framework for Video Multi-Person Pose Estimation

arxiv url: http://arxiv.org/abs/2509.01095v1
Date: Mon, 01 Sep 2025 03:34:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.539417
Title: An End-to-End Framework for Video Multi-Person Pose Estimation
Title（参考訳）: ビデオマルチパーソン視点推定のためのエンドツーエンドフレームワーク
Authors: Zhihong Wei,
Abstract要約: 本稿では,ビデオの終末ポーズ推定のための簡易かつ柔軟なフレームワークVEPEを提案する。提案手法は, 2段階モデルより300%, 推測より300%優れていた。
参考スコア（独自算出の注目度）: 3.090225730976977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-based human pose estimation models aim to address scenarios that cannot be effectively solved by static image models such as motion blur, out-of-focus and occlusion. Most existing approaches consist of two stages: detecting human instances in each image frame and then using a temporal model for single-person pose estimation. This approach separates the spatial and temporal dimensions and cannot capture the global spatio-temporal context between spatial instances for end-to-end optimization. In addition, it relies on separate detectors and complex post-processing such as RoI cropping and NMS, which reduces the inference efficiency of the video scene. To address the above problems, we propose VEPE (Video End-to-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video. The framework utilizes three crucial spatio-temporal Transformer components: the Spatio-Temporal Pose Encoder (STPE), the Spatio-Temporal Deformable Memory Encoder (STDME), and the Spatio-Temporal Pose Decoder (STPD). These components are designed to effectively utilize temporal context for optimizing human body pose estimation. Furthermore, to reduce the mismatch problem during the cross-frame pose query matching process, we propose an instance consistency mechanism, which aims to enhance the consistency and discrepancy of the cross-frame instance query and realize the instance tracking function, which in turn accurately guides the pose query to perform cross-frame matching. Extensive experiments on the Posetrack dataset show that our approach outperforms most two-stage models and improves inference efficiency by 300%.
Abstract（参考訳）: ビデオベースの人間のポーズ推定モデルは、モーションボケ、アウト・オブ・フォーカス、オクルージョンといった静的画像モデルでは効果的に解決できないシナリオに対処することを目的としている。既存のアプローチのほとんどは、2つのステージで構成されている: 各画像フレーム内の人間のインスタンスを検出し、その後、一人称ポーズ推定のための時間モデルを使用する。このアプローチは、空間的次元と時間的次元を分離し、エンドツーエンド最適化のための空間的インスタンス間のグローバルな時空間コンテキストをキャプチャできない。さらに、ビデオシーンの推論効率を低下させるため、別個の検出器とRoI収穫やNMSのような複雑な後処理に依存している。上記の問題に対処するため,ビデオにおけるエンドツーエンドポーズ推定のための簡易かつ柔軟なフレームワークであるVEPE(Video End-to-End Pose Estimation)を提案する。このフレームワークは3つの重要な時空間トランスフォーマーコンポーネント(時空間ポスエンコーダ(STPE)、時空間デフォーマブルメモリエンコーダ(STDME)、時空間ポスデコーダ(STPD)を使用する。これらのコンポーネントは、人体ポーズ推定の最適化に時間的コンテキストを効果的に活用するために設計されている。さらに、クロスフレームポーズクエリマッチングプロセスにおけるミスマッチ問題を低減するために、クロスフレームインスタンスクエリの一貫性と不一致性を向上し、インスタンス追跡機能を実現するインスタンス整合性機構を提案する。 Posetrackデータセットの大規模な実験により、我々のアプローチは、ほとんどの2段階モデルより優れ、推論効率が300%向上することが示された。

論文の概要: An End-to-End Framework for Video Multi-Person Pose Estimation

関連論文リスト