Pix2Point: Learning Outdoor 3D Using Sparse Point Clouds and Optimal
Transport
- URL: http://arxiv.org/abs/2107.14498v1
- Date: Fri, 30 Jul 2021 09:03:39 GMT
- Title: Pix2Point: Learning Outdoor 3D Using Sparse Point Clouds and Optimal
Transport
- Authors: R\'emy Leroy, Pauline Trouv\'e-Peloux, Fr\'ed\'eric Champagnat,
Bertrand Le Saux, Marcela Carvalho
- Abstract summary: deep learning has recently provided us with excellent results for monocular depth estimation.
We propose Pix2Point, a deep learning-based approach for monocular 3D point cloud prediction.
Our method relies on a 2D-3D hybrid neural network architecture, and a supervised end-to-end minimisation of an optimal transport divergence.
- Score: 35.10680020334443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Good quality reconstruction and comprehension of a scene rely on 3D
estimation methods. The 3D information was usually obtained from images by
stereo-photogrammetry, but deep learning has recently provided us with
excellent results for monocular depth estimation. Building up a sufficiently
large and rich training dataset to achieve these results requires onerous
processing. In this paper, we address the problem of learning outdoor 3D point
cloud from monocular data using a sparse ground-truth dataset. We propose
Pix2Point, a deep learning-based approach for monocular 3D point cloud
prediction, able to deal with complete and challenging outdoor scenes. Our
method relies on a 2D-3D hybrid neural network architecture, and a supervised
end-to-end minimisation of an optimal transport divergence between point
clouds. We show that, when trained on sparse point clouds, our simple promising
approach achieves a better coverage of 3D outdoor scenes than efficient
monocular depth methods.
Related papers
- ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images [19.02348585677397]
Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase.
The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated.
We propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap.
arXiv Detail & Related papers (2024-10-31T15:02:05Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Weakly Supervised 3D Multi-person Pose Estimation for Large-scale Scenes
based on Monocular Camera and Single LiDAR [41.39277657279448]
We propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes.
Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud.
Our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision.
arXiv Detail & Related papers (2022-11-30T12:50:40Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z) - Points2NeRF: Generating Neural Radiance Fields from 3D point cloud [0.0]
We propose representing 3D objects as Neural Radiance Fields (NeRFs)
We leverage a hypernetwork paradigm and train the model to take a 3D point cloud with the associated color values.
Our method provides efficient 3D object representation and offers several advantages over the existing approaches.
arXiv Detail & Related papers (2022-06-02T20:23:33Z) - Unsupervised Learning of Fine Structure Generation for 3D Point Clouds
by 2D Projection Matching [66.98712589559028]
We propose an unsupervised approach for 3D point cloud generation with fine structures.
Our method can recover fine 3D structures from 2D silhouette images at different resolutions.
arXiv Detail & Related papers (2021-08-08T22:15:31Z) - From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object
Detection [101.20784125067559]
We propose a new architecture, namely Hallucinated Hollow-3D R-CNN, to address the problem of 3D object detection.
In our approach, we first extract the multi-view features by sequentially projecting the point clouds into the perspective view and the bird-eye view.
The 3D objects are detected via a box refinement module with a novel Hierarchical Voxel RoI Pooling operation.
arXiv Detail & Related papers (2021-07-30T02:00:06Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Exploring Deep 3D Spatial Encodings for Large-Scale 3D Scene
Understanding [19.134536179555102]
We propose an alternative approach to overcome the limitations of CNN based approaches by encoding the spatial features of raw 3D point clouds into undirected graph models.
The proposed method achieves on par state-of-the-art accuracy with improved training time and model stability thus indicating strong potential for further research.
arXiv Detail & Related papers (2020-11-29T12:56:19Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.