BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training
- URL: http://arxiv.org/abs/2204.10209v1
- Date: Thu, 21 Apr 2022 15:45:05 GMT
- Title: BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training
- Authors: Kaushik Balakrishnan, Devesh Upadhyay
- Abstract summary: In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
- Score: 0.304585143845864
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The task of 2D human pose estimation is challenging as the number of
keypoints is typically large (~ 17) and this necessitates the use of robust
neural network architectures and training pipelines that can capture the
relevant features from the input image. These features are then aggregated to
make accurate heatmap predictions from which the final keypoints of human body
parts can be inferred. Many papers in literature use CNN-based architectures
for the backbone, and/or combine it with a transformer, after which the
features are aggregated to make the final keypoint predictions [1]. In this
paper, we consider the recently proposed Bottleneck Transformers [2], which
combine CNN and multi-head self attention (MHSA) layers effectively, and we
integrate it with a Transformer encoder and apply it to the task of 2D human
pose estimation. We consider different backbone architectures and pre-train
them using the DINO self-supervised learning method [3], this pre-training is
found to improve the overall prediction accuracy. We call our model BTranspose,
and experiments show that on the COCO validation set, our model achieves an AP
of 76.4, which is competitive with other methods such as [1] and has fewer
network parameters. Furthermore, we also present the dependencies of the final
predicted keypoints on both the MHSA block and the Transformer encoder layers,
providing clues on the image sub-regions the network attends to at the mid and
high levels.
Related papers
- Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry [1.2289361708127877]
We propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry.
The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference.
arXiv Detail & Related papers (2024-09-13T12:21:25Z) - Scalable Property Valuation Models via Graph-based Deep Learning [5.172964916120902]
We develop two novel graph neural network models that effectively identify sequences of neighboring houses with similar features.
We show that employing tailored graph neural networks significantly improves the accuracy of house price prediction.
arXiv Detail & Related papers (2024-05-10T15:54:55Z) - Cross-domain and Cross-dimension Learning for Image-to-Graph
Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model.
We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers.
We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z) - Swin-Pose: Swin Transformer Based Human Pose Estimation [16.247836509380026]
Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks.
CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation.
We propose a novel model based on transformer architecture, enhanced with a feature pyramid fusion structure.
arXiv Detail & Related papers (2022-01-19T02:15:26Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Locally Aware Piecewise Transformation Fields for 3D Human Mesh
Registration [67.69257782645789]
We propose piecewise transformation fields that learn 3D translation vectors to map any query point in posed space to its correspond position in rest-pose space.
We show that fitting parametric models with poses by our network results in much better registration quality, especially for extreme poses.
arXiv Detail & Related papers (2021-04-16T15:16:09Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.