Combining detection and tracking for human pose estimation in videos
- URL: http://arxiv.org/abs/2003.13743v1
- Date: Mon, 30 Mar 2020 18:45:31 GMT
- Title: Combining detection and tracking for human pose estimation in videos
- Authors: Manchen Wang, Joseph Tighe, Davide Modolo
- Abstract summary: We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos.
Our method is not limited by the performance of its person detector and can predict the poses of person instances not localized.
Our approach achieves state-of-the-art results on both joint detection and tracking, on the PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down approaches.
- Score: 18.851860324105637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel top-down approach that tackles the problem of multi-person
human pose estimation and tracking in videos. In contrast to existing top-down
approaches, our method is not limited by the performance of its person detector
and can predict the poses of person instances not localized. It achieves this
capability by propagating known person locations forward and backward in time
and searching for poses in those regions. Our approach consists of three
components: (i) a Clip Tracking Network that performs body joint detection and
tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline
that merges the fixed-length tracklets produced by the Clip Tracking Network to
arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that
refines the joint locations based on spatial and temporal smoothing terms.
Thanks to the precision of our Clip Tracking Network and our merging procedure,
our approach produces very accurate joint predictions and can fix common
mistakes on hard scenarios like heavily entangled people. Our approach achieves
state-of-the-art results on both joint detection and tracking, on both the
PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down
approaches.
Related papers
- Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences.
We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z) - Improving Multi-Person Pose Tracking with A Confidence Network [37.84514614455588]
We develop a novel keypoint confidence network and a tracking pipeline to improve human detection and pose estimation.
Specifically, the keypoint confidence network is designed to determine whether each keypoint is occluded.
In the tracking pipeline, we propose the Bbox-revision module to reduce missing detection and the ID-retrieve module to correct lost trajectories.
arXiv Detail & Related papers (2023-10-29T06:36:27Z) - Multi-view Tracking Using Weakly Supervised Human Motion Prediction [60.972708589814125]
We argue that an even more effective approach is to predict people motion over time and infer people's presence in individual frames from these.
This enables to enforce consistency both over time and across views of a single temporal frame.
We validate our approach on the PETS2009 and WILDTRACK datasets and demonstrate that it outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T17:58:23Z) - Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D
Pose Estimation Tracking and Forecasting on a Video Snippet [24.852728097115744]
Multi-person pose understanding from RGB involves three complex tasks: pose estimation, tracking and motion forecasting.
Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately.
We propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage.
arXiv Detail & Related papers (2022-07-09T18:42:14Z) - Dual networks based 3D Multi-Person Pose Estimation from Monocular Video [42.01876518017639]
Multi-person 3D pose estimation is more challenging than single pose estimation.
Existing top-down and bottom-up approaches to pose estimation suffer from detection errors.
We propose the integration of top-down and bottom-up approaches to exploit their strengths.
arXiv Detail & Related papers (2022-05-02T08:53:38Z) - Cross-Camera Trajectories Help Person Retrieval in a Camera Network [124.65912458467643]
Existing methods often rely on purely visual matching or consider temporal constraints but ignore the spatial information of the camera network.
We propose a pedestrian retrieval framework based on cross-camera generation, which integrates both temporal and spatial information.
To verify the effectiveness of our method, we construct the first cross-camera pedestrian trajectory dataset.
arXiv Detail & Related papers (2022-04-27T13:10:48Z) - On Triangulation as a Form of Self-Supervision for 3D Human Pose
Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant.
Much of the recent attention has shifted towards semi and (or) weakly supervised learning.
We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z) - Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences.
A temporal assessment network is proposed which is able to capture the temporal coherence of target locations.
A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z) - Rank-based verification for long-term face tracking in crowded scenes [0.0]
We present a long-term, multi-face tracking architecture conceived for working in crowded contexts.
Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking.
arXiv Detail & Related papers (2021-07-28T11:15:04Z) - Learning Dynamics via Graph Neural Networks for Human Pose Estimation
and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame.
Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information.
Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z) - From Planes to Corners: Multi-Purpose Primitive Detection in Unorganized
3D Point Clouds [59.98665358527686]
We propose a new method for segmentation-free joint estimation of orthogonal planes.
Such unified scene exploration allows for multitudes of applications such as semantic plane detection or local and global scan alignment.
Our experiments demonstrate the validity of our approach in numerous scenarios from wall detection to 6D tracking.
arXiv Detail & Related papers (2020-01-21T06:51:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.