MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection
- URL: http://arxiv.org/abs/2412.01422v1
- Date: Mon, 02 Dec 2024 12:03:32 GMT
- Title: MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection
- Authors: Yonghao Dang, Liyuan Liu, Hui Kang, Ping Ye, Jianqin Yin,
- Abstract summary: MamKPD is the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection.<n>By combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information.<n>Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU.
- Score: 13.678314551293113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at https://mamkpd.github.io/.
Related papers
- Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video [57.805927523341516]
Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips.<n>Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion.<n>We will publicly release the source code of Mamba-OTR to support future research.
arXiv Detail & Related papers (2025-07-22T08:23:51Z) - MedSegMamba: 3D CNN-Mamba Hybrid Architecture for Brain Segmentation [15.514511820130474]
We develop a 3D patch-based hybrid CNN-Mamba model for subcortical brain segmentation.
Our model's performance was validated against several benchmarks.
arXiv Detail & Related papers (2024-09-12T02:19:19Z) - LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - Keypoint Aware Masked Image Modelling [0.34530027457862006]
KAMIM improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs.
We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers.
arXiv Detail & Related papers (2024-07-18T19:41:46Z) - Vision Mamba for Classification of Breast Ultrasound Images [9.90112908284836]
Mamba-based models, VMamba and Vim, are a recent family of vision encoders that offer promising performance improvements in many computer vision tasks.
This paper compares Mamba-based models with traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) using the breast ultrasound BUSI dataset and Breast Ultrasound B dataset.
arXiv Detail & Related papers (2024-07-04T00:21:47Z) - MaIL: Improving Imitation Learning with Mamba [30.96458274130313]
Mamba Imitation Learning (MaIL) provides an alternative to state-of-the-art (SoTA) Transformer-based policies.
Mamba's architecture enhances representation learning efficiency by focusing on key features.
MaIL consistently outperforms Transformers on all LIBERO tasks with limited data.
arXiv Detail & Related papers (2024-06-12T14:01:12Z) - MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection [72.46396769642787]
We develop a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient infrared small target detection.
MiM-ISTD is $8 times$ faster than the SOTA method and reduces GPU memory usage by 62.2$%$ when testing on $2048 times 2048$ images.
arXiv Detail & Related papers (2024-03-04T15:57:29Z) - Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining [85.08169822181685]
This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks.
Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models.
arXiv Detail & Related papers (2024-02-05T18:58:11Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - Rethinking Keypoint Representations: Modeling Keypoints and Poses as
Objects for Multi-Person Human Pose Estimation [79.78017059539526]
We propose a new heatmap-free keypoint estimation method in which individual keypoints and sets of spatially related keypoints (i.e., poses) are modeled as objects within a dense single-stage anchor-based detection framework.
In experiments, we observe that KAPAO is significantly faster and more accurate than previous methods, which suffer greatly from heatmap post-processing.
Our large model, KAPAO-L, achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without test-time augmentation.
arXiv Detail & Related papers (2021-11-16T15:36:44Z) - Learning Delicate Local Representations for Multi-Person Pose Estimation [77.53144055780423]
We propose a novel method called Residual Steps Network (RSN)
RSN aggregates features with the same spatial size (Intra-level features) efficiently to obtain delicate local representations.
Our approach won the 1st place of COCO Keypoint Challenge 2019.
arXiv Detail & Related papers (2020-03-09T10:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.