MBW: Multi-view Bootstrapping in the Wild
- URL: http://arxiv.org/abs/2210.01721v1
- Date: Tue, 4 Oct 2022 16:27:54 GMT
- Title: MBW: Multi-view Bootstrapping in the Wild
- Authors: Mosam Dabhi, Chaoyang Wang, Tim Clifford, Laszlo Attila Jeni, Ian R.
Fasel, Simon Lucey
- Abstract summary: Multi-camera systems that train fine-grained detectors have shown promise in detecting such errors.
The approach is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios.
In this paper, we address these bottlenecks by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark estimates.
We are able to produce 2D results comparable to state-of-the-art fully supervised methods, along with 3D reconstructions that are impossible with other existing approaches.
- Score: 30.038254895713276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Labeling articulated objects in unconstrained settings have a wide variety of
applications including entertainment, neuroscience, psychology, ethology, and
many fields of medicine. Large offline labeled datasets do not exist for all
but the most common articulated object categories (e.g., humans). Hand labeling
these landmarks within a video sequence is a laborious task. Learned landmark
detectors can help, but can be error-prone when trained from only a few
examples. Multi-camera systems that train fine-grained detectors have shown
significant promise in detecting such errors, allowing for self-supervised
solutions that only need a small percentage of the video sequence to be
hand-labeled. The approach, however, is based on calibrated cameras and rigid
geometry, making it expensive, difficult to manage, and impractical in
real-world scenarios. In this paper, we address these bottlenecks by combining
a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark
estimates from videos with only two or three uncalibrated, handheld cameras.
With just a few annotations (representing 1-2% of the frames), we are able to
produce 2D results comparable to state-of-the-art fully supervised methods,
along with 3D reconstructions that are impossible with other existing
approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach
demonstrates impressive results on standard human datasets, as well as tigers,
cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos
captured casually in a zoo. We release the codebase for MBW as well as this
challenging zoo dataset consisting image frames of tail-end distribution
categories with their corresponding 2D, 3D labels generated from minimal human
intervention.
Related papers
- Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular
Videos in the Wild [10.849750765175754]
POTR-3D is a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE.
It robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs.
arXiv Detail & Related papers (2023-09-15T06:17:22Z) - Unsupervised Multi-view Pedestrian Detection [12.882317991955228]
We propose an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to eliminate the need of annotations to learn a multi-view pedestrian detector via 2D-3D mapping.
SIS is proposed to extract unsupervised representations of multi-view images, which are converted into 2D pedestrian masks as pseudo labels.
GVD encodes multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D mapping.
arXiv Detail & Related papers (2023-05-21T13:27:02Z) - Reconstructing Animatable Categories from Videos [65.14948977749269]
Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging.
We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time.
We show that 3D models of humans, cats, and dogs can be learned from 50-100 internet videos.
arXiv Detail & Related papers (2023-05-10T17:56:21Z) - Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable
Categories [80.30216777363057]
We introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets.
At test time, given a small number of video frames of an unseen object, Tracker-NeRF predicts the trajectories of its 3D points and generates new views.
Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines.
arXiv Detail & Related papers (2022-11-07T22:42:42Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z) - AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs
in the Wild [51.35013619649463]
We present an extensive dataset of free-running cheetahs in the wild, called AcinoSet.
The dataset contains 119,490 frames of multi-view synchronized high-speed video footage, camera calibration files and 7,588 human-annotated frames.
The resulting 3D trajectories, human-checked 3D ground truth, and an interactive tool to inspect the data is also provided.
arXiv Detail & Related papers (2021-03-24T15:54:11Z) - Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated
Convolution [34.301501457959056]
We propose a temporal regression network with a gated convolution module to transform 2D joints to 3D.
A simple yet effective localization approach is also conducted to transform the normalized pose to the global trajectory.
Our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods.
arXiv Detail & Related papers (2020-10-31T04:35:24Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Monocular, One-stage, Regression of Multiple 3D People [105.3143785498094]
We propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP)
Our method simultaneously predicts a Body Center heatmap and a Mesh map, which can jointly describe the 3D body mesh on the pixel level.
Compared with state-of-the-art methods, ROMP superior performance on the challenging multi-person benchmarks.
arXiv Detail & Related papers (2020-08-27T17:21:47Z) - Full-Body Awareness from Partial Observations [17.15829643665034]
We propose a self-training framework that adapts human 3D mesh recovery systems to consumer videos.
We show that our method substantially improves PCK and human-subject judgments compared to baselines.
arXiv Detail & Related papers (2020-08-13T17:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.