End-to-End Trainable Multi-Instance Pose Estimation with Transformers
- URL: http://arxiv.org/abs/2103.12115v1
- Date: Mon, 22 Mar 2021 18:19:22 GMT
- Title: End-to-End Trainable Multi-Instance Pose Estimation with Transformers
- Authors: Lucas Stoffl and Maxime Vidal and Alexander Mathis
- Abstract summary: We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
- Score: 68.93512627479197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new end-to-end trainable approach for multi-instance pose
estimation by combining a convolutional neural network with a transformer. We
cast multi-instance pose estimation from images as a direct set prediction
problem. Inspired by recent work on end-to-end trainable object detection with
transformers, we use a transformer encoder-decoder architecture together with a
bipartite matching scheme to directly regress the pose of all individuals in a
given image. Our model, called POse Estimation Transformer (POET), is trained
using a novel set-based global loss that consists of a keypoint loss, a
keypoint visibility loss, a center loss and a class loss. POET reasons about
the relations between detected humans and the full image context to directly
predict the poses in parallel. We show that POET can achieve high accuracy on
the challenging COCO keypoint detection task. To the best of our knowledge,
this model is the first end-to-end trainable multi-instance human pose
estimation method.
Related papers
- Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z) - Poseur: Direct Human Pose Regression with Transformers [119.79232258661995]
We propose a direct, regression-based approach to 2D human pose estimation from single images.
Our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints.
Ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
arXiv Detail & Related papers (2022-01-19T04:31:57Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Transformer Transforms Salient Object Detection and Camouflaged Object
Detection [43.79585695098729]
We conduct research on applying the transformer networks for salient object detection (SOD)
Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision.
As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation.
arXiv Detail & Related papers (2021-04-20T17:12:51Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.