Related papers: AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

URL: http://arxiv.org/abs/2412.00837v2
Date: Sat, 05 Jul 2025 05:05:35 GMT
Title: AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer
Authors: Jin Lyu, Tianyi Zhu, Yi Gu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang, Liang An,
Abstract summary: This paper presents AniMer to estimate animal pose and shape using family aware Transformer.<n>A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme.<n>For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels.
Score: 29.97192007630272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications. The project page of AniMer is https://luoxue-star.github.io/AniMer_project_page/.

Related papers

AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer [26.738709781346678]
We introduce AniMer+, an extended version of our scalable AniMer framework.<n>A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT)<n>We produce two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds.
arXiv Detail & Related papers (2025-08-01T03:53:03Z)
Generative Zoo [41.65977386204797]
We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world animal pose and shape estimation benchmark.
arXiv Detail & Related papers (2024-12-11T04:57:53Z)
Multispecies Animal Re-ID Using a Large Community-Curated Dataset [0.19418036471925312]
We construct a dataset that includes 49 species, 37K individual animals, and 225K images, using this data to train a single embedding network for all species. Our model consistently outperforms models trained separately on each species, achieving an average gain of 12.5% in top-1 accuracy. The model is already in production use for 60+ species in a large-scale wildlife monitoring system.
arXiv Detail & Related papers (2024-12-07T09:56:33Z)
Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z)
Learning the 3D Fauna of the Web [70.01196719128912]
We develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data. We show that prior category-specific attempts fail to generalize to rare species with limited training images.
arXiv Detail & Related papers (2024-01-04T18:32:48Z)
Virtual Pets: Animatable Animal Generation in 3D Scenes [84.0990909455833]
We introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. We leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background. We develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning.
arXiv Detail & Related papers (2023-12-21T18:59:30Z)
Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats [80.12253291709673]
We propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks. Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model.
arXiv Detail & Related papers (2022-12-29T22:22:49Z)
Prior-Aware Synthetic Data to the Rescue: Animal Pose Estimation with Very Limited Real Data [18.06492246414256]
We present a data efficient strategy for pose estimation in quadrupeds that requires only a small amount of real images from the target animal. It is confirmed that fine-tuning a backbone network with pretrained weights on generic image datasets such as ImageNet can mitigate the high demand for target animal pose data. We introduce a prior-aware synthetic animal data generation pipeline called PASyn to augment the animal pose data essential for robust pose estimation.
arXiv Detail & Related papers (2022-08-30T01:17:50Z)
Towards Individual Grevy's Zebra Identification via Deep 3D Fitting and Metric Learning [2.004276260443012]
This paper combines deep learning techniques for species detection, 3D model fitting, and metric learning in one pipeline to perform individual animal identification. We show in a small study on the SMALST dataset that the use of 3D model fitting can indeed benefit performance. Back-projected textures from 3D fitted models improve identification accuracy from 48.0% to 56.8% compared to 2D bounding box approaches.
arXiv Detail & Related papers (2022-06-05T20:44:54Z)
Coarse-to-fine Animal Pose and Shape Estimation [67.39635503744395]
We propose a coarse-to-fine approach to reconstruct 3D animal mesh from a single image. The coarse estimation stage first estimates the pose, shape and translation parameters of the SMAL model. The estimated meshes are then used as a starting point by a graph convolutional network (GCN) to predict a per-vertex deformation in the refinement stage.
arXiv Detail & Related papers (2021-11-16T01:27:20Z)
AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild [51.35013619649463]
We present an extensive dataset of free-running cheetahs in the wild, called AcinoSet. The dataset contains 119,490 frames of multi-view synchronized high-speed video footage, camera calibration files and 7,588 human-annotated frames. The resulting 3D trajectories, human-checked 3D ground truth, and an interactive tool to inspect the data is also provided.
arXiv Detail & Related papers (2021-03-24T15:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.