Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation
- URL: http://arxiv.org/abs/2502.18214v1
- Date: Tue, 25 Feb 2025 13:58:37 GMT
- Title: Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation
- Authors: Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu,
- Abstract summary: We propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation.<n>Our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts.<n>Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting.
- Score: 24.010615027857007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints.
Related papers
- Decomposing Query-Key Feature Interactions Using Contrastive Covariances [75.38737409771085]
We study the query-key space -- the bilinear joint embedding space between queries and keys.<n>It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced.
arXiv Detail & Related papers (2026-02-04T16:50:02Z) - Instance-Adaptive Keypoint Learning with Local-to-Global Geometric Aggregation for Category-Level Object Pose Estimation [19.117822086210513]
INKL-Pose is a novel category-level object pose estimation framework.
It enables INstance-adaptive Keypoint Learning with local-to-global geometric aggregation.
Experiments on CAMERA25, REAL275, and HouseCat6D demonstrate that INKL-Pose achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-04-21T14:37:37Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - SCAPE: A Simple and Strong Category-Agnostic Pose Estimator [6.705257644513057]
Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner.
We introduce two key modules: a global keypoint feature perceptor to inject global semantic information into support keypoints, and a keypoint attention refiner to enhance inter-node correlation between keypoints.
SCAPE outperforms prior arts by 2.2 and 1.3 PCK under 1-shot and 5-shot settings with faster inference speed and lighter model capacity.
arXiv Detail & Related papers (2024-07-18T13:02:57Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - Pose for Everything: Towards Category-Agnostic Pose Estimation [93.07415325374761]
Category-Agnostic Pose Estimation (CAPE) aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition.
A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images.
We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms.
arXiv Detail & Related papers (2022-07-21T09:40:54Z) - CLAMP: Prompt-based Contrastive Learning for Connecting Language and
Animal Pose [70.59906971581192]
We introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose effectively.
The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training.
Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2022-06-23T14:51:42Z) - Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression [81.05772887221333]
We study the dense keypoint regression framework that is previously inferior to the keypoint detection and grouping framework.
We present a simple yet effective approach, named disentangled keypoint regression (DEKR)
We empirically show that the proposed direct regression method outperforms keypoint detection and grouping methods.
arXiv Detail & Related papers (2021-04-06T05:54:46Z) - TransPose: Towards Explainable Human Pose Estimation by Transformer [17.39838556906491]
We construct an explainable model named TransPose based on Transformer architecture and low-level convolutional blocks.
Given an image, the attention layers built in Transformer can capture long-range spatial relationships between keypoints.
Experiments show that TransPose can accurately predict the positions of keypoints.
arXiv Detail & Related papers (2020-12-28T12:33:52Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.