Related papers: Combining detection and tracking for human pose estimation in videos

Combining detection and tracking for human pose estimation in videos

URL: http://arxiv.org/abs/2003.13743v1
Date: Mon, 30 Mar 2020 18:45:31 GMT
Title: Combining detection and tracking for human pose estimation in videos
Authors: Manchen Wang, Joseph Tighe, Davide Modolo
Abstract summary: We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. Our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. Our approach achieves state-of-the-art results on both joint detection and tracking, on the PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down approaches.
Score: 18.851860324105637
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. In contrast to existing top-down approaches, our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. It achieves this capability by propagating known person locations forward and backward in time and searching for poses in those regions. Our approach consists of three components: (i) a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms. Thanks to the precision of our Clip Tracking Network and our merging procedure, our approach produces very accurate joint predictions and can fix common mistakes on hard scenarios like heavily entangled people. Our approach achieves state-of-the-art results on both joint detection and tracking, on both the PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down approaches.

Related papers

DELTAv2: Accelerating Dense 3D Tracking [79.63990337419514]
We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos.<n>We introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories.<n>The newly added trajectories are using a learnable module, which is trained end-to-end alongside the tracking network.
arXiv Detail & Related papers (2025-08-02T03:15:47Z)
ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking [41.889032460337226]
ProTracker is a novel framework for accurate and robust long-term dense tracking of arbitrary points in videos. This design effectively combines global semantic information with temporally aware low-level features. Experiments demonstrate that ProTracker attains state-of-the-art performance among optimization-based approaches.
arXiv Detail & Related papers (2025-01-06T18:55:52Z)
Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z)
Improving Multi-Person Pose Tracking with A Confidence Network [37.84514614455588]
We develop a novel keypoint confidence network and a tracking pipeline to improve human detection and pose estimation. Specifically, the keypoint confidence network is designed to determine whether each keypoint is occluded. In the tracking pipeline, we propose the Bbox-revision module to reduce missing detection and the ID-retrieve module to correct lost trajectories.
arXiv Detail & Related papers (2023-10-29T06:36:27Z)
Multi-view Tracking Using Weakly Supervised Human Motion Prediction [60.972708589814125]
We argue that an even more effective approach is to predict people motion over time and infer people's presence in individual frames from these. This enables to enforce consistency both over time and across views of a single temporal frame. We validate our approach on the PETS2009 and WILDTRACK datasets and demonstrate that it outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T17:58:23Z)
Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet [24.852728097115744]
Multi-person pose understanding from RGB involves three complex tasks: pose estimation, tracking and motion forecasting. Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately. We propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage.
arXiv Detail & Related papers (2022-07-09T18:42:14Z)
Dual networks based 3D Multi-Person Pose Estimation from Monocular Video [42.01876518017639]
Multi-person 3D pose estimation is more challenging than single pose estimation. Existing top-down and bottom-up approaches to pose estimation suffer from detection errors. We propose the integration of top-down and bottom-up approaches to exploit their strengths.
arXiv Detail & Related papers (2022-05-02T08:53:38Z)
Cross-Camera Trajectories Help Person Retrieval in a Camera Network [124.65912458467643]
Existing methods often rely on purely visual matching or consider temporal constraints but ignore the spatial information of the camera network. We propose a pedestrian retrieval framework based on cross-camera generation, which integrates both temporal and spatial information. To verify the effectiveness of our method, we construct the first cross-camera pedestrian trajectory dataset.
arXiv Detail & Related papers (2022-04-27T13:10:48Z)
On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. Much of the recent attention has shifted towards semi and (or) weakly supervised learning. We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z)
Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences. A temporal assessment network is proposed which is able to capture the temporal coherence of target locations. A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z)
Rank-based verification for long-term face tracking in crowded scenes [0.0]
We present a long-term, multi-face tracking architecture conceived for working in crowded contexts. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking.
arXiv Detail & Related papers (2021-07-28T11:15:04Z)
Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame. Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information. Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z)
From Planes to Corners: Multi-Purpose Primitive Detection in Unorganized 3D Point Clouds [59.98665358527686]
We propose a new method for segmentation-free joint estimation of orthogonal planes. Such unified scene exploration allows for multitudes of applications such as semantic plane detection or local and global scan alignment. Our experiments demonstrate the validity of our approach in numerous scenarios from wall detection to 6D tracking.
arXiv Detail & Related papers (2020-01-21T06:51:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.