FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction
- URL: http://arxiv.org/abs/2105.01937v1
- Date: Wed, 5 May 2021 09:08:12 GMT
- Title: FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction
- Authors: Brian Gordon, Sigal Raab, Guy Azov, Raja Giryes, Daniel Cohen-Or
- Abstract summary: Multi-view algorithms strongly depend on camera parameters, in particular, the relative positions among the cameras.
We introduce FLEX, an end-to-end parameter-free multi-view model.
We demonstrate results on the Human3.6M and KTH Multi-view Football II datasets.
- Score: 70.09086274139504
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The increasing availability of video recordings made by multiple cameras has
offered new means for mitigating occlusion and depth ambiguities in pose and
motion reconstruction methods. Yet, multi-view algorithms strongly depend on
camera parameters, in particular, the relative positions among the cameras.
Such dependency becomes a hurdle once shifting to dynamic capture in
uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an
end-to-end parameter-free multi-view model. FLEX is parameter-free in the sense
that it does not require any camera parameters, neither intrinsic nor
extrinsic. Our key idea is that the 3D angles between skeletal parts, as well
as bone lengths, are invariant to the camera position. Hence, learning 3D
rotations and bone lengths rather than locations allows predicting common
values for all camera views. Our network takes multiple video streams, learns
fused deep features through a novel multi-view fusion layer, and reconstructs a
single consistent skeleton with temporally coherent joint rotations. We
demonstrate quantitative and qualitative results on the Human3.6M and KTH
Multi-view Football II datasets. We compare our model to state-of-the-art
methods that are not parameter-free and show that in the absence of camera
parameters, we outperform them by a large margin while obtaining comparable
results when camera parameters are available. Code, trained models, video
demonstration, and additional materials will be available on our project page.
Related papers
- Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention [62.2447324481159]
Cavia is a novel framework for camera-controllable, multi-view video generation.
Our framework extends the spatial and temporal attention modules, improving both viewpoint and temporal consistency.
Cavia is the first of its kind that allows the user to specify distinct camera motion while obtaining object motion.
arXiv Detail & Related papers (2024-10-14T17:46:32Z) - MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image Acquisition Systems [22.494866649536018]
Neural Radiance Fields (NeRF) use multi-view images for 3D scene representation, demonstrating remarkable performance.
Most previous NeRF-based methods assume a unique camera and rarely consider multi-camera scenarios.
We propose MC-NeRF, a method that enables joint optimization of both intrinsic and extrinsic parameters alongside NeRF.
arXiv Detail & Related papers (2023-09-14T16:40:44Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - Multi-task Learning for Camera Calibration [3.274290296343038]
We present a unique method for predicting intrinsic (principal point offset and focal length) and extrinsic (baseline, pitch, and translation) properties from a pair of images.
By reconstructing the 3D points using a camera model neural network and then using the loss in reconstruction to obtain the camera specifications, this innovative camera projection loss (CPL) method allows us that the desired parameters should be estimated.
arXiv Detail & Related papers (2022-11-22T17:39:31Z) - Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose
Estimation [10.625664582408687]
3D Human Pose Estimation (HPE) is facing with several variable elements, involving the number of views, the length of the video sequence, and whether using camera calibration.
We propose a unified framework named Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without calibration.
arXiv Detail & Related papers (2021-10-11T08:57:43Z) - Camera Calibration through Camera Projection Loss [4.36572039512405]
We propose a novel method to predict intrinsic (focal length and principal point offset) parameters using an image pair.
Unlike existing methods, we proposed a new representation that incorporates camera model equations as a neural network in multi-task learning framework.
Our proposed approach achieves better performance with respect to both deep learning-based and traditional methods on 7 out of 10 parameters evaluated.
arXiv Detail & Related papers (2021-10-07T14:03:10Z) - MonoCInIS: Camera Independent Monocular 3D Object Detection using
Instance Segmentation [55.96577490779591]
Methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data.
We show that more data does not automatically guarantee a better performance, but rather, methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data.
arXiv Detail & Related papers (2021-10-01T14:56:37Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.