Related papers: DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation

DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation

URL: http://arxiv.org/abs/2508.08783v1
Date: Tue, 12 Aug 2025 09:37:09 GMT
Title: DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation
Authors: Tianyu Xiong, Dayi Tan, Wei Tian,
Abstract summary: We introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation.<n>Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models.
Score: 1.1708207558288541
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.

Related papers

Denoised Diffusion for Object-Focused Image Augmentation [0.6109833303919141]
We propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings.<n>Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes.<n>By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios.
arXiv Detail & Related papers (2025-10-10T03:03:40Z)
A Review on Coarse to Fine-Grained Animal Action Recognition [23.001797172183345]
Review explores the field of animal action recognition, focusing on coarse-grained (FGCG) and fine-grained (FGG) techniques.<n>Examines the current state of research in animal behaviour recognition and to elucidate the unique challenges associated with recognising subtle animal actions in outdoor environments.<n>Review outlines future directions for advancing fine-grained action recognition, aiming to improve accuracy and generalisability in behaviour analysis across species.
arXiv Detail & Related papers (2025-06-01T23:31:25Z)
Categorical Keypoint Positional Embedding for Robust Animal Re-Identification [22.979350771097966]
Animal re-identification (ReID) has become an indispensable tool in ecological research.<n>Unlike human ReID, animal ReID faces significant challenges due to the high variability in animal poses, diverse environmental conditions, and the inability to directly apply pre-trained models to animal data.<n>This work introduces an innovative keypoint propagation mechanism, which utilizes a single annotated pre-trained diffusion model.
arXiv Detail & Related papers (2024-12-01T14:09:00Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions [57.871692507044344]
Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images. Current models are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment. We introduce PoseBench, a benchmark designed to evaluate the robustness of pose estimation models against real-world corruption.
arXiv Detail & Related papers (2024-06-20T14:40:17Z)
Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z)
Denoising Diffusion Semantic Segmentation with Mask Prior Modeling [61.73352242029671]
We propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a denoising diffusion generative model. We evaluate the proposed prior modeling with several off-the-shelf segmentors, and our experimental results on ADE20K and Cityscapes demonstrate that our approach could achieve competitively quantitative performance.
arXiv Detail & Related papers (2023-06-02T17:47:01Z)
CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose [70.59906971581192]
We introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose effectively. The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training. Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2022-06-23T14:51:42Z)
SemiMultiPose: A Semi-supervised Multi-animal Pose Estimation Framework [10.523555645910255]
Multi-animal pose estimation is essential for studying animals' social behaviors in neuroscience and neuroethology. We propose a novel semi-supervised architecture for multi-animal pose estimation, leveraging the pervasive structures in unlabeled frames in behavior videos. The resulting algorithm will provide superior multi-animal pose estimation results on three animal experiments.
arXiv Detail & Related papers (2022-04-14T16:06:55Z)
SuperAnimal pretrained pose estimation models for behavioral analysis [42.206265576708255]
Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts. We present a series of technical innovations that enable a new method, collectively called SuperAnimal, to develop unified foundation models.
arXiv Detail & Related papers (2022-03-14T18:46:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.