Web-Scale Collection of Video Data for 4D Animal Reconstruction
- URL: http://arxiv.org/abs/2511.01169v1
- Date: Mon, 03 Nov 2025 02:40:06 GMT
- Title: Web-Scale Collection of Video Data for 4D Animal Reconstruction
- Authors: Brian Nlong Zhao, Jiajun Wu, Shangzhe Wu,
- Abstract summary: We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips.<n>Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works.<n>We present Animal-in-Motion, a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions.
- Score: 26.179284343904897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.
Related papers
- BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations [38.868479054644354]
Recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience.<n>Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process.<n>$texttBigMaQ$ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition.
arXiv Detail & Related papers (2026-02-23T14:21:15Z) - 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos [15.063635374924209]
We propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations.<n>Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process.
arXiv Detail & Related papers (2025-07-14T16:24:31Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [69.51086319339662]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.<n>Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.<n>Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z) - Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats [4.512615837610558]
We propose a method called Recurrence over Video Frames (RoVF), which uses a recurrent head based on the Perceiver architecture to iteratively construct an embedding from a video clip.
We tested this method and various models based on the DINOv2 transformer architecture on a dataset of meerkats collected at the Wellington Zoo.
Our method achieves a top-1 re-identification accuracy of $49%$, which is higher than that of the best DINOv2 model ($42%$)
arXiv Detail & Related papers (2024-06-18T18:44:19Z) - Virtual Pets: Animatable Animal Generation in 3D Scenes [84.0990909455833]
We introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment.
We leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background.
We develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning.
arXiv Detail & Related papers (2023-12-21T18:59:30Z) - Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos [47.97168047776216]
We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos.
Our model learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features.
arXiv Detail & Related papers (2023-12-21T06:44:18Z) - Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable
Categories [80.30216777363057]
We introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets.
At test time, given a small number of video frames of an unseen object, Tracker-NeRF predicts the trajectories of its 3D points and generates new views.
Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines.
arXiv Detail & Related papers (2022-11-07T22:42:42Z) - APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking [77.87449881852062]
APT-36K is the first large-scale benchmark for animal pose estimation and tracking.
It consists of 2,400 video clips collected and filtered from 30 animal species with 15 frames for each video, resulting in 36,000 frames in total.
We benchmark several representative models on the following three tracks: (1) supervised animal pose estimation on a single frame under intra- and inter-domain transfer learning settings, (2) inter-species domain generalization test for unseen animals, and (3) animal pose estimation with animal tracking.
arXiv Detail & Related papers (2022-06-12T07:18:36Z) - AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs
in the Wild [51.35013619649463]
We present an extensive dataset of free-running cheetahs in the wild, called AcinoSet.
The dataset contains 119,490 frames of multi-view synchronized high-speed video footage, camera calibration files and 7,588 human-annotated frames.
The resulting 3D trajectories, human-checked 3D ground truth, and an interactive tool to inspect the data is also provided.
arXiv Detail & Related papers (2021-03-24T15:54:11Z) - ZooBuilder: 2D and 3D Pose Estimation for Quadrupeds Using Synthetic
Data [2.3661942553209236]
We train 2D and 3D pose estimation models with synthetic data, and put in place an end-to-end pipeline called ZooBuilder.
The pipeline takes as input a video of an animal in the wild, and generates the corresponding 2D and 3D coordinates for each joint of the animal's skeleton.
arXiv Detail & Related papers (2020-09-01T07:41:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.