Improved TokenPose with Sparsity
- URL: http://arxiv.org/abs/2311.09653v1
- Date: Thu, 16 Nov 2023 08:12:34 GMT
- Title: Improved TokenPose with Sparsity
- Authors: Anning Li
- Abstract summary: We introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation.
Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past few years, the vision transformer and its various forms have
gained significance in human pose estimation. By treating image patches as
tokens, transformers can capture global relationships wisely, estimate the
keypoint tokens by leveraging the visual tokens, and recognize the posture of
the human body. Nevertheless, global attention is computationally demanding,
which poses a challenge for scaling up transformer-based methods to
high-resolution features. In this paper, we introduce sparsity in both keypoint
token attention and visual token attention to improve human pose estimation.
Experimental results on the MPII dataset demonstrate that our model has a
higher level of accuracy and proved the feasibility of the method, achieving
new state-of-the-art results. The idea can also provide references for other
transformer-based models.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Enhancing Landmark Detection in Cluttered Real-World Scenarios with
Vision Transformers [2.900522306460408]
This research contributes to the advancement of landmark detection in visual place recognition.
It shows the potential of leveraging vision transformers to overcome challenges posed by cluttered real-world scenarios.
arXiv Detail & Related papers (2023-08-25T21:01:01Z) - MiVOLO: Multi-input Transformer for Age and Gender Estimation [0.0]
We present MiVOLO, a straightforward approach for age and gender estimation using the latest vision transformer.
Our method integrates both tasks into a unified dual input/output model.
We compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges.
arXiv Detail & Related papers (2023-07-10T14:58:10Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - PPT: token-Pruned Pose Transformer for monocular and multi-view human
pose estimation [25.878375219234975]
We propose a token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens.
We also propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates.
arXiv Detail & Related papers (2022-09-16T23:22:47Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z) - Learning Generative Vision Transformer with Energy-Based Latent Space
for Saliency Prediction [51.80191416661064]
We propose a novel vision transformer with latent variables following an informative energy-based prior for salient object detection.
Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation.
With the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image.
arXiv Detail & Related papers (2021-12-27T06:04:33Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.