Waterfall Transformer for Multi-person Pose Estimation
- URL: http://arxiv.org/abs/2411.18944v1
- Date: Thu, 28 Nov 2024 06:24:40 GMT
- Title: Waterfall Transformer for Multi-person Pose Estimation
- Authors: Navin Ranjan, Bruno Artacho, Andreas Savakis,
- Abstract summary: We propose the Waterfall Transformer architecture for Pose estimation (WTPose)
WTPose is a single-pass, end-to-end trainable framework designed for multi-person pose estimation.
- Score: 3.2771631221674333
- License:
- Abstract: We propose the Waterfall Transformer architecture for Pose estimation (WTPose), a single-pass, end-to-end trainable framework designed for multi-person pose estimation. Our framework leverages a transformer-based waterfall module that generates multi-scale feature maps from various backbone stages. The module performs filtering in the cascade architecture to expand the receptive fields and to capture local and global context, therefore increasing the overall feature representation capability of the network. Our experiments on the COCO dataset demonstrate that the proposed WTPose architecture, with a modified Swin backbone and transformer-based waterfall module, outperforms other transformer architectures for multi-person pose estimation
Related papers
- MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor
Formula for Image Dehazing [88.61523825903998]
Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision.
We propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity.
We introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales.
Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost
arXiv Detail & Related papers (2023-08-27T08:10:23Z) - Swin-Pose: Swin Transformer Based Human Pose Estimation [16.247836509380026]
Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks.
CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation.
We propose a novel model based on transformer architecture, enhanced with a feature pyramid fusion structure.
arXiv Detail & Related papers (2022-01-19T02:15:26Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - BAPose: Bottom-Up Pose Estimation with Disentangled Waterfall
Representations [3.8073142980733]
BAPose is a novel framework that achieves state-of-the-art results for multi-person pose estimation.
Our results on the challenging COCO and CrowdPose datasets demonstrate that BAPose is an efficient and robust framework.
arXiv Detail & Related papers (2021-12-20T18:07:09Z) - UniNet: Unified Architecture Search with Convolution, Transformer, and
MLP [62.401161377258234]
In this paper, we propose to jointly search the optimal combination of convolution, transformer, and COCO for building a series of all-operator network architectures.
We identify that the widely-used strided convolution or pooling based down-sampling modules become the performance bottlenecks when operators are combined to form a network.
To better tackle the global context captured by the transformer and operators, we propose two novel context-aware down-sampling modules.
arXiv Detail & Related papers (2021-10-08T11:09:40Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation [3.8073142980733]
We propose a single-pass, end-to-end trainable framework that achieves state-of-the-art results for multi-person pose estimation.
Our results on multiple datasets demonstrate that OmniPose is a robust and efficient architecture for multi-person pose estimation.
arXiv Detail & Related papers (2021-03-18T11:30:31Z) - UniPose: Unified Human Pose Estimation in Single Images and Videos [3.04585143845864]
We propose a unified framework for human pose estimation, based on our "Waterfall" Atrous Spatial Pooling architecture.
UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage.
Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation.
arXiv Detail & Related papers (2020-01-22T15:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.